Molecular prognostic signature for predicting breast cancer metastasis, and uses thereof

ABSTRACT

The present invention is based on the discovery of a unique 14-gene molecular prognostic signature that is useful for predicting breast cancer metastasis. In particular, the present invention relates to methods and reagents for detecting and profiling the expression levels of these genes, and methods of using the expression level information in predicting risk of breast cancer metastasis.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of U.S. application Ser.No. 14/083,755, filed on Nov. 19, 2013, which is a divisional of U.S.application Ser. No. 12/638,040, filed on Dec. 15, 2009, which is adivisional of U.S. application Ser. No. 12/012,530, filed on Jan. 31,2008 (and issued as U.S. Pat. No. 7,695,915 on Apr. 13, 2010), whichclaims the benefit of U.S. provisional application Ser. No. 60/898,963,filed on Jan. 31, 2007, the content of each of which are herebyincorporated by reference in their entirety into this application.

FIELD OF THE INVENTION

The present invention relates to prognosis of breast cancer metastasis.In particular, the present invention relates to a multi-gene prognosticsignature that is useful in predicting risk of metastasis of a breastcancer patient's node-negative estrogen receptor (ER)-positive tumor.The multi-gene prognostic signature comprises 14 genes, whose mRNA in abreast cancer patient's ER-positive tumor can be obtained fromformalin-fixed, paraffin-embedded (FFPE) tissue sections, and theirexpression levels measured by methods known in the art. Thus, thepresent invention is amenable for use in routine clinical laboratorytesting for assessing the risk of distant metastasis of node-negativeER-positive breast cancer.

BACKGROUND OF THE INVENTION

Breast cancer is a complex and heterogeneous disease. Early detection ofbreast cancer improves the chances of successful treatment and recovery.Routine screening mammography has increased the detection of Stage Ibreast cancers and correspondingly, many more women are being diagnosedwith lymph node-negative tumors. (B. Cady, 1997, Surg Oncol Clin N Am6:195-202). About 43% of the approximately 240,000 women in the UnitedStates diagnosed with breast cancer each year are node-negative.

Based on the current guidelines, 85-90% of node-negative patients arecandidates for systemic adjuvant therapy after surgery. Such systemicadjuvant therapy may include chemotherapy and hormonal therapy. However,about 60-70% of women with node-negative breast cancer who receive localtreatment (mastectomy or lumpectomy and radiation) will not experiencedistant recurrence. Treatment decisions for breast cancer patientsbenefit from the assessment of each patient's risk for metastasis andresponse to treatment using multiple clinical and histopathologicalparameters.

Several recent studies have used microarrays to demonstrate that apatient's gene expression profile can also provide useful prognosticinformation. A subset of these studies has received focused attentiondue to their size, and the extent of their validation. (L J van't Veer,H. Dai et al., 2002, Nature 415:530-536; M J van de Vijver, Y D He etal., 2002, N Engl J Med 347:1999-2009; Y. Wang, J G Klijn et al., 2005,Lancet 365:671-679; H. Dai, L J van't Veer et al., 2005, Cancer Res15:4059-4066; and H Y Chang, D S Nuyten et al., 2005, Proc Natl Acad SciUSA 102:3738-3743).

The resulting confidence garnered for the 70-gene prognostic signatureidentified by van't Veer, Dai et al. (2002, Nature 415:530-536) has ledto its incorporation into a European trial, the Microarray forNode-Negative Disease May Avoid Chemotherapy (MINDACT). Likewise, thePCR-based, 21-gene predictive signature described by SP Paik, S. Shak etal. (2004, N Engl J Med 351:2817-2826) has been included in a phase IIItrial by The Breast Cancer Intergroup of North America (Program for theAssessment of Clinical Cancer Tests (PACCT). (V G Kaklamani and W JGradishar, 2006, Curr Treat Options Oncol 7:123-8).

The 21-gene predictive signature (including 5 normalization genes) by SPPaik (2004, N Engl J Med 351:2817-2826) was derived fromTamoxifen-treated patients. The independence of that signature has drawnconcern due to its substantial overlap with genes and/or proteinsalready used in conventional immunohistochemistry (IHC) tests. (D RCarrizosa and L A Carey, 2005, The American Journal of Oncology Review4:7-10). The standard hormonal therapy for ER-positive breast cancerpatients is changing from Tamoxifen alone, to sequential use ofTamoxifen plus aromatase inhibitors, or aromatase inhibitors alone. (E PWiner, C. Hudis et al., 2005, J Clin Oncol 23: 619-629; S M Swain, 2005,N Engl J Med 353:2807-9). A prognostic tool that is independent ofTamoxifen treatment can be important in providing a measure of thebaseline risk for patients who plan on taking aromatase inhibitors.

Thus, there is a need for a gene-based prognostic assay that can be usedfor routine clinical laboratory testing in predicting the risk ofdistant metastasis in breast cancer patients. Ideally, the assay wouldrequire the measurement of expression levels of a relatively smallnumber of genes, and the mRNA encoded by such genes can be readilyobtained from tumor tissues preserved by routine collection methods suchas FFPE tumor sections. Information of the risk for distant metastasiscan be used in guiding treatment strategies for breast cancer patients,particularly early stage lymph node-negative patients, such thatpatients who are at higher risk of distant metastasis are treatedproperly, and patients who are at lower risk of distant metastasis maybe spared the side effects of certain treatments.

SUMMARY OF THE INVENTION

The present invention relates to a 14-gene signature for predicting riskof metastasis of ER-positive tumors in breast cancer patients. Theinvention is based, in part, on studies of early stage, lymphnode-negative, ER-positive patients who most need additional informationto guide therapeutic decisions following primary diagnosis. The fourteengenes in the molecular signature of the present invention are disclosedin Table 2. One skilled in the art can perform expression profiling onthe 14 genes described herein, using RNA obtained from a number ofpossible sources, and then insert the expression data into the providedalgorithm to determine a prognostic metastasis score.

In one aspect of the invention, it relates to a method of determiningrisk associated with tumor metastasis in a breast cancer patient,comprising measuring mRNA expression of the genes known as CENPA,PKMYT1, MELK, MYBL2, BUB1, RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11,DIAPH3, ORC6L and CCNB1 in estrogen receptor-positive tumor cells of thebreast cancer patient, and predicting risk of tumor metastasis based onmRNA expression levels of said genes.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient, comprising measuring the expression level of genes CENPA,PKMYT1, MELK, MYBL2, BUB1, RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11,DIAPH3, ORC6L and CCNB1 in estrogen receptor-positive tumor cells ofsaid breast cancer patient, thereby obtaining a metastasis score (MS)based upon the expression levels of said genes, and determining risk oftumor metastasis for said breast cancer patient by comparing saidmetastasis score to a predefined metastasis score cut point (MSThreshold).

In a further aspect of the invention, the breast cancer patient isdetermined to have an increased risk of tumor metastasis if its MS ishigher than the predefined MS Threshold.

In another aspect of the invention, the breast cancer patient isdetermined to have a decreased risk of tumor metastasis if its MS islower than the predefined MS Threshold.

In one aspect of the invention, it relates to a method of determiningrisk associated with tumor metastasis in a breast cancer patient, inwhich mRNA of the 14-gene signature is obtained from ER-positive tumorcells, reverse transcribed to cDNA, and detected by polymerase chainreaction amplification.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient, in which mRNA of ER-positive tumor cells is reverse transcribedand amplified by the two primers associated with each gene as presentedin Table 3, SEQ ID NOS. 1-34.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient, in which measurements of mRNA expression from ER-positive tumorcells are normalized against the mRNA expression of any one of the genesknown as NUP214, PPIG and SLU7, or a combination thereof, as endogenouscontrol(s).

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient, in which mRNA expression from ER-positive tumor cells isdetected by a microarray.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient, in which the mRNA expression is computed into a metastasisscore (MS) by the following:

${MS} = {{a\; 0} + {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}}}$

where M=14, Gi=the standardized expression level of each gene (i) of thefourteen said genes, a0=0.022, and ai corresponds to the value presentedin Table 2 for each of the genes in the 14-gene signature.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient, in which the mRNA expression is computed into a metastasisscore (MS) by the following:

$\begin{matrix}{{MS} = {{a\; 0} + {b*\left\lbrack {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}} \right\rbrack}}} & {{Equation}\mspace{14mu} 1}\end{matrix}$

where M=14, Gi=the standardized expression level of each gene (i) of thefourteen said genes, a0=0.022, b=−0.251 and ai corresponds to the valuepresented in Table 2 for each of the genes in the 14-gene signature.Standardized expression level is obtained by subtracting the meanexpression of that gene in the training set from the expression levelmeasured in Δ(ΔCt) and then divided by the standard deviation of thegene expression in that gene. The mean and standard deviation of geneexpression for each gene in the training set were presented in Table 4.Equation 1 was used in Examples 1, 2 and 3.

In a further aspect of the invention, the MS formula can assume thefollowing definition

$\begin{matrix}{{MS} = {{a\; 0} + {b*\left\lbrack {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}} \right\rbrack}}} & {{Equation}\mspace{14mu} 2}\end{matrix}$

where M=14, Gi=expression level measured in Δ(ΔCt) of each gene (i) ofthe fourteen said genes, a0=0.8657, b=−0.04778, ai=1 for all genes.Equation 2 was used in Examples 4 and 5.

In a further aspect of the invention, the MS formula can have a0=0, b=−1and ai=1.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient using expression profiling of the 14 genes mentioned above, inwhich the expression level Gi of each gene (i) is computed into a geneexpression value Gi by the following:

Δ(ΔCt)=(Ct _(GOI) −Ct _(EC))_(test RNA)−(Ct _(GOI) −Ct_(EC))_(ref RNA)  Equation 4

where Ct is the PCR threshold cycle of exponential target amplification,GOI=gene of interest, EC=endogenous control, test RNA=patient sampleRNA, ref RNA=reference RNA.

In another aspect of the invention, it relates to a method ofdetermining risk associated with tumor metastasis in a breast cancerpatient using expression profiling of the 14 genes mentioned above, inwhich the expression level Gi of each gene (i) is combined into a singlevalue of MS Score, wherein a patient with a MS score higher than therelevant MS Threshold or cut point would be at a higher risk for tumormetastasis.

In one aspect of the invention, it relates to a kit comprising reagentsfor the detection of the expression levels of genes CENPA, PKMYT1, MELK,MYBL2, BUB1, RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11, DIAPH3, ORC6L andCCNB1, and enzyme; and a buffer.

In another aspect of the invention, it relates to a microarraycomprising polynucleotides hybridizing to genes CENPA, PKMYT1, MELK,MYBL2, BUB1, RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11, DIAPH3, ORC6L andCCNB1

SEQUENCE LISTING

The attached Sequence Listing is herein incorporated by reference in itsentirety. The Sequence Listing provides the oligonucleotide sequences(SEQ ID NOS: 1-34) as shown in Table 3. These oligonucleotides areexemplary primers in the RT-PCR amplification of the genes listed inTable 3.

BRIEF DESCRIPTIONS OF THE FIGURES

FIG. 1 shows Kaplan-Meier curves for a) time to distant metastases b)overall survival for training set from CPMC where high-risk and low-riskgroups were defined by MS(CV) using zero as cut point.

FIG. 2 is a ROC curve for predicting distant metastases by MS(CV) in 5years in the training set from CPMC. AUC=0.76 (0.65-0.87).

FIG. 3 shows Kaplan-Meier curves by risk groups defined by the genesignature and Adjuvant! in 280 untreated patients from Guy's Hospital.Specifically, FIGS. 3 a) and b) describe results using the 14 genesignature, and FIGS. 3 c) and d) describe results using the Adjuvant!factors. a) Time to distant me astases (DMFS) by MS risk groups b)Overall survival by MS risk groups c) Time to distant metastases (DMFS)by Adjuvant! risk groups d) Overall survival by Adjuvant! risk groups.

FIG. 4 shows Receiver operating characteristic (ROC) curves of the genesignature and of the online program Adjuvant! a) ROC curve for distantmetastases within 5 years for the gene signature b) ROC curve fordistant metastases within 10 years for the gene signature c) ROC curvefor death within 10 years for the gene signature d) ROC curve formetastases within 5 years for Adjuvant! e) ROC curve for metastaseswithin 10 years for Adjuvant! f) ROC curve for death within 10 years forAdjuvant! for untreated patients from Guy's Hospital

FIG. 5 shows probability of distant metastasis within 5 years and 10years vs. Metastasis Score (MS) from 280 Guy's untreated patients.

FIG. 6 is a comparison of probability of distant metastasis in 10 yearsfrom 14-gene signature vs, 10-year relapse probability from Adjuvant!for untreated patients from Guy's Hospital

FIG. 7 shows Kaplan-Meier curves for distant-metastasis-free survival inUniversity of Muenster patients.

FIG. 8 shows Kaplan-Meier curves of distant-metastasis-free survival in3 MS groups (high, intermediate, low) for 205 treated patients fromGuy's Hospital.

FIG. 9 shows Kaplan-Meier curves of distant-metastasis-free survival in2 risk groups (high and low) determined by MS for 205 treated patientsfrom Guy's Hospital.

FIG. 10 shows ROC curve of MS to predict distant metastasis in 5 yearsfor Guy's treated samples, AUC=0.7 (0.57-0.87).

FIG. 11 shows time dependence of hazard ratios of high vs. low riskgroups by MS in Guy's treated samples.

FIG. 12 shows Kaplan-Meier curves of distant-metastasis-free survival(DMFS) for three MS groups (high, intermediate and low) in 234 Japanesesamples.

FIG. 13 shows Kaplan-Meier curves of distant-metastasis-free survival(DMFS) for two risk groups (high MS have high risk whereas intermediateand low MS have low risk) in 234 Japanese samples.

FIG. 14 shows ROC curve of MS to predict distant metastasis in 5 yearsfor Japanese patients. AUC=0.73 (0.63-0.84).

FIG. 15 shows annualized hazard rate for MS groups and hazard ratio ofhigh vs. low risk groups as a function of time.

DETAILED DESCRIPTION OF THE INVENTION

The present invention provides a multi-gene signature that can be usedfor predicting breast cancer metastasis, methods and reagents for thedetection of the genes disclosed herein, and assays or kits that utilizesuch reagents. The breast cancer metastasis-associated genes disclosedherein are useful for diagnosing, screening for, and evaluatingprobability of distant metastasis of ER-positive tumors in breast cancerpatients.

Expression profiling of the 14 genes of the molecular signaturedisclosed in Table 2 allows for prognosis of distant metastasis to bereadily inferred. The information provided in Table 2 includes areference sequence (RefSeq), obtained from the National Center forBiotechnology Information (NCBI) of the National Institutes ofHealth/National Library of Medicine, which identifies one varianttranscript sequence of each described gene. Based on the sequence of thevariant, reagents may be designed to detect all variants of each gene ofthe 14-gene signature. Table 3 provides exemplary primer sets that canbe used to detect each gene of the 14-gene signature in a manner suchthat all variants of each gene are amplified. Thus, the presentinvention provides for expression profiling of all known transcriptvariants of all genes disclosed herein.

Also shown in Table 2 is the reference that publishes the nucleotidesequence of each RefSeq. These references are all herein incorporated byreference in their entirety. Also in Table 2 is a description of eachgene. Both references and descriptions were provided by NCBI.

The CENPA gene is identified by reference sequence NM_001809 anddisclosed in Black, B. E., Foltz, D. R., et al., 2004, Nature430(6999):578-582. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The PKMYT1 gene, identified by reference sequence NM_004203, anddisclosed in Bryan, B. A., Dyson, O. F. et al., 2006, J. Gen. Virol. 87(PT 3), 519-529. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The MELK gene, identified by reference sequence NM_014791, and disclosedin Beullens, M., Vancauwenbergh, S. et al., 2005, J. Biol. Chem. 280(48), 40003-40011. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The MYBL2 gene, identified by reference sequence NM_002466, anddisclosed in Bryan, B. A., Dyson, O. F. et al., 2006, J. Gen. Virol. 87(PT 3), 519-529. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The BUB1 gene, identified by reference sequence NM_004366, and disclosedin Morrow, C. J., Tighe, A. et al., 2005, J. Cell. Sci. 118 (PT 16),3639-3652. Said reference sequence and reference are herein incorporatedby reference in their entirety.

The RACGAP1 gene, identified by reference sequence NM_013277, anddisclosed in Niiya, F., Xie, X. et al., 2005, J. Biol. Chem. 280 (43),36502-36509. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The TK1 gene, identified by reference sequence NM_003258, and disclosedin Karbownik, M., Brzezianska, E. et al., 2005, Cancer Lett. 225 (2),267-273. Said reference sequence and reference are herein incorporatedby reference in their entirety.

The UBE2S gene, identified by reference sequence NM_014501, anddisclosed in Liu, Z., Diaz, L. A. et al., 1992, J. Biol. Chem. 267 (22),15829-15835. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The DC13 gene, identified by reference sequence AF201935, and disclosedin Gu, Y., Peng, Y. et al., Direct Submission, Submitted Nov. 5, 1999,Chinese National Human Genome Center at Shanghai, 351 Guo Shoujing Road,Zhangjiang Hi-Tech Park, Pudong, Shanghai 201203, P. R. China. Saidreference sequence and reference are herein incorporated by reference intheir entirety.

The RFC4 gene, identified by reference sequence NM_002916, and disclosedin Gupte, R. S., Weng, Y. et al., 2005, Cell Cycle 4 (2), 323-329. Saidreference sequence and reference are herein incorporated by reference intheir entirety.

The PRR11 gene, identified by reference sequence NM_018304, anddisclosed in Weinmann, A. S., Yan, P. S. et al., 2002, Genes Dev. 16(2), 235-244. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The DIAPH3 gene, identified by reference sequence NM_030932, anddisclosed in Katoh, M. and Katoh, M., 2004, Int. J. Mol. Med. 13 (3),473-478. Said reference sequence and reference are herein incorporatedby reference in their entirety.

The ORC6L gene, identified by reference sequence NM_014321, anddisclosed in Sibani, S., Price, G. B. et al., 2005, Biochemistry 44(21), 7885-7896. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The CCNB1 gene, identified by reference sequence NM_031966, anddisclosed in Zhao, M., Kim, Y. T. et al., 2006, Exp Oncol 28 (1), 44-48.Said reference sequence and reference are herein incorporated byreference in their entirety.

The PPIG gene, identified by reference sequence NM_004792, and disclosedin Lin, C. L., Leu, S. et al., 2004, Biochem. Biophys. Res. Commun. 321(3), 638-647. Said reference sequence and reference are hereinincorporated by reference in their entirety.

The NUP214 gene, identified by reference sequence NM_005085, anddisclosed in Graux, C., Cools, J. et al., 2004, Nat. Genet. 36 (10),1084-1089. Said reference sequence and reference are herein incorporatedby reference in their entirety.

The SLU7 gene, identified by reference sequence NM_006425, and disclosedin Shomron, N., Alberstein, M. et al., 2005, J. Cell. Sci. 118 (PT 6),1151-1159. Said reference sequence and reference are herein incorporatedby reference in their entirety.

Also shown in Table 2 is the value for the constant ai required fordetermining the metastasis score for each gene i, based on theexpression profiling results obtained for that gene. The derivation ofthe metastasis score and its use, and methods of gene expressionprofiling and use of the data obtained therefrom, are described below.

Thus, the present invention provides 14 individual genes which togetherare prognostic for breast cancer metastasis, methods of determiningexpression levels of these genes in a test sample, methods ofdetermining the probability of an individual of developing distantmetastasis, and methods of using the disclosed genes to select atreatment strategy.

The present invention provides a unique combination of a 14 genesignature that were not previously known in the art. Accordingly, thepresent invention provides novel methods based on the genes disclosedherein, and also provides novel methods of using the known, butpreviously unassociated, genes in methods relating to breast cancermetastasis (e.g., for prognosis of breast cancer metastasis).

Those skilled in the art will readily recognize that nucleic acidmolecules may be double-stranded molecules and that reference to aparticular sequence of one strand refers, as well, to the correspondingsite on a complementary strand. In defining a nucleotide sequence,reference to an adenine, a thymine (uridine), a cytosine, or a guanineat a particular site on one strand of a nucleic acid molecule alsodefines the thymine (uridine), adenine, guanine, or cytosine(respectively) at the corresponding site on a complementary strand ofthe nucleic acid molecule. Thus, reference may be made to either strandin order to refer to a particular nucleotide sequence. Probes andprimers may be designed to hybridize to either strand and geneexpression profiling methods disclosed herein may generally targeteither strand.

Tumor Tissue Source and RNA Extraction

In the present invention, target polynucleotide molecules are extractedfrom a sample taken from an individual afflicted with breast cancer. Thesample may be collected in any clinically acceptable manner, but must becollected such that gene-specific polynucleotides (i.e., transcript RNA,or message) are preserved. The mRNA or nucleic acids so obtained fromthe sample may then be analyzed further. For example, pairs ofoligonucleotides specific for a gene (e.g., the genes presented in Table2) may be used to amplify the specific mRNA(s) in the sample. The amountof each message can then be determined, or profiled, and the correlationwith a disease prognosis can be made. Alternatively, mRNA or nucleicacids derived therefrom (i.e. cDNA, amplified DNA or enriched RNA) maybe labeled distinguishably from standard or control polynucleotidemolecules, and both may be simultaneously or independently hybridized toa microarray comprising some or all of the markers or marker sets orsubsets described above. Alternatively, mRNA or nucleic acids derivedthere from may be labeled with the same label as the standard or controlpolynucleotide molecules, wherein the intensity of hybridization of eachat a particular probe is compared.

A sample may comprise any clinically relevant tissue sample, such as aformalin fixed paraffin embedded sample, frozen sample, tumor biopsy orfine needle aspirate, or a sample of bodily fluid containing ER-positivetumor cells such as blood, plasma, serum, lymph, ascitic or cysticfluid, urine, or nipple exudate.

Methods for preparing total and poly (A)+RNA are well known and aredescribed generally in Sambrook et al., MOLECULAR CLONING—A LABORATORYMANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold SpringHarbor, N.Y. (1989)) and Ausubel et al., Current Protocols in MolecularBiology vol. 2, Current Protocols Publishing, New York (1994)). RNA maybe isolated from ER-positive tumor cells by any procedures well-known inthe art, generally involving lysis of the cells and denaturation of theproteins contained therein.

As an example of preparing RNA from tissue samples, RNA may also beisolated from formalin-fixed paraffin-embedded (FFPE) tissues usingtechniques well known in the art. Commercial kits for this purpose maybe obtained, e.g., from Zymo Research, Ambion, Qiagen, or Stratagene. Anexemplary method of isolating total RNA from FFPE tissue, according tothe method of the Pinpoint Slide RNA Isolation System (Zymo Research,Orange, Calif.) is as follows. Briefly, the solution obtained from theZymo kit is applied over the selected FFPE tissue region of interest andallowed to dry. The embedded tissue is then removed from the slide andplaced in a centrifuge tube with proteinase K. The tissue is incubatedfor several hours, then the cell lysate is centrifuged and thesupernatant transferred to another tube. RNA is extracted from thelysate by means of a guanidinium thiocynate/β mercaptoethanol solution,to which ethanol is added and mixed. Sample is applied to a spin column,and spun one minute. The column is washed with buffer containing ethanoland Tris/EDTA. DNase is added to the column, and incubated. RNA iseluted from the column by adding heated RNase-free water to the columnand centrifuging. Pure total RNA is present in the eluate.

Additional steps may be employed to remove contaminating DNA, such asthe addition of DNase to the spin column, described above. Cell lysismay be accomplished with a nonionic detergent, followed bymicro-centrifugation to remove the nuclei and hence the bulk of thecellular DNA. In one embodiment, RNA is extracted from cells of thevarious types of interest by cell lysis in the presence of guanidiniumthiocyanate, followed by CsCl centrifugation to separate the RNA fromDNA (Chirgwin et al., Biochemistry 18:5294-5299 (1979)). Poly(A)+RNA isselected with oligo-dT cellulose (see Sambrook et al., MOLECULARCLONING—A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring HarborLaboratory, Cold Spring Harbor, N.Y. (1989). Alternatively, separationof RNA from DNA can be accomplished by organic extraction, for example,with hot phenol or phenol/chloroform/isoamyl alcohol.

If desired, RNase inhibitors may be added to the lysis buffer. Likewise,for certain cell types it may be desirable to add a proteindenaturation/digestion step to the protocol.

For many applications, it is desirable to preferentially enrich mRNAwith respect to other cellular RNAs extracted from cells, such astransfer RNA (tRNA) and ribosomal RNA (rRNA). Most mRNAs contain poly(A)tails at their 3′ ends. This allows for enrichment by affinitychromatography; for example, using oligo(dT) or poly(U) coupled to asolid support, such as cellulose or Sephadex™ (see Ausubel et al.,CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, vol. 2, Current ProtocolsPublishing, New York (1994). After being bound in this manner,poly(A)+mRNA is eluted from the affinity column using 2 mM EDTA/0.1%SDS.

The sample of RNA can comprise a plurality of different mRNA molecules,each mRNA molecule having a different nucleotide sequence. In a specificembodiment, the mRNA molecules of the RNA sample comprise mRNAcorresponding to each of the fourteen genes disclosed herein. In afurther specific embodiment, total RNA or mRNA from cells are used inthe methods of the invention. The source of the RNA can be cells fromany ER-positive tumor cell. In specific embodiments, the method of theinvention is used with a sample containing total mRNA or total RNA from1×10⁶ cells or fewer.

Reagents for Measuring Gene Expression

The present invention provides nucleic acid molecules that can be usedin gene expression profiling and in determining prognosis of breastcancer metastasis. Exemplary nucleic acid molecules that can be used asprimers in gene expression profiling of the 14-gene signature describedherein are shown in Table 3.

As indicated in Table 3:

Gene BUB1 is reverse-transcribed and amplified with SEQ ID NO: 1 as theUpper primer (5′), and SEQ ID NO: 2 as the Lower primer (3′).

Gene CCNB1 is reverse-transcribed and amplified with SEQ ID NO: 3 as theUpper primer (5′), and SEQ ID NO: 4 as the Lower primer (3′).

Gene CENPA is reverse-transcribed and amplified with SEQ ID NO: 5 as theUpper primer (5′), and SEQ ID NO: 6 as the Lower primer (3′).

Gene DC13 is reverse-transcribed and amplified with SEQ ID NO: 7 as theUpper primer (5′), and SEQ ID NO: 8 as the Lower primer (3′).

Gene DIAPH3 is reverse-transcribed and amplified with SEQ ID NO: 9 asthe Upper primer (5′), and SEQ ID NO: 10 as the Lower primer (3′).

Gene MELK is reverse-transcribed and amplified with SEQ ID NO: 11 as theUpper primer (5′), and SEQ ID NO: 12 as the Lower primer (3′).

Gene MYBL2 is reverse-transcribed and amplified with SEQ ID NO: 13 asthe Upper primer (5′), and SEQ ID NO: 14 as the Lower primer (3′).

Gene NUP214 is reverse-transcribed and amplified with SEQ ID NO: 29 asthe Upper primer (5′), and SEQ ID NO: 30 as the Lower primer (3′).

Gene ORC6L is reverse-transcribed and amplified with SEQ ID NO: 15 asthe Upper primer (5′), and SEQ ID NO: 16 as the Lower primer (3′).

Gene PKMYT1 is reverse-transcribed and amplified with SEQ ID NO: 17 asthe Upper primer (5′), and SEQ ID NO: 18 as the Lower primer (3′).

Gene PPIG is reverse-transcribed and amplified with SEQ ID NO: 31 as theUpper primer (5′), and SEQ ID NO: 32 as the Lower primer (3′).

Gene PRR11 is reverse-transcribed and amplified with SEQ ID NO: 19 asthe Upper primer (5′), and SEQ ID NO: 20 as the Lower primer (3′).

Gene RACGAP1 is reverse-transcribed and amplified with SEQ ID NO: 21 asthe Upper primer (5′), and SEQ ID NO: 22 as the Lower primer (3′).

Gene RFC4 is reverse-transcribed and amplified with SEQ ID NO: 23 as theUpper primer (5′), and SEQ ID NO: 24 as the Lower primer (3′).

Gene SLU7 is reverse-transcribed and amplified with SEQ ID NO: 33 as theUpper primer (5′), and SEQ ID NO: 34 as the Lower primer (3′).

Gene TK1 is reverse-transcribed and amplified with SEQ ID NO: 25 as theUpper primer (5′), and SEQ ID NO: 26 as the Lower primer (3′).

Gene UBE2S is reverse-transcribed and amplified with SEQ ID NO: 27 asthe Upper primer (5′), and SEQ ID NO: 28 as the Lower primer (3′).

Based on the complete nucleotide sequence of each gene as shown in Table2, one skilled in the art can readily design and synthesize additionalprimers and/or probes that can be used in the amplification and/ordetection of the 14-gene signature described herein.

In a specific aspect of the present invention, the sequences disclosedin Table 3 can be used as gene expression profiling reagents. As usedherein, a “gene expression profiling reagent” is a reagent that isspecifically useful in the process of amplifying and/or detecting thenucleotide sequence of a specific target gene, whether that sequence ismRNA or cDNA, of the genes described herein. For example, the profilingreagent preferably can differentiate between different alternative genenucleotide sequences, thereby allowing the identity and quantificationof the nucleotide sequence to be determined. Typically, such a profilingreagent hybridizes to a target nucleic acid molecule by complementarybase-pairing in a sequence-specific manner, and discriminates the targetsequence from other nucleic acid sequences such as an art-known form ina test sample. An example of a detection reagent is a probe thathybridizes to a target nucleic acid containing a nucleotide sequencesubstantially complementary to one of the sequences provided in Table 3.In a preferred embodiment, such a probe can differentiate betweennucleic acids of different genes. Another example of a detection reagentis a primer which acts as an initiation point of nucleotide extensionalong a complementary strand of a target polynucleotide, as in reversetranscription or PCR. The sequence information provided herein is alsouseful, for example, for designing primers to reverse transcribe and/oramplify (e.g., using PCR) any gene of the present invention.

In one preferred embodiment of the invention, a detection reagent is anisolated or synthetic DNA or RNA polynucleotide probe or primer or PNAoligomer, or a combination of DNA, RNA and/or PNA, that hybridizes to asegment of a target nucleic acid molecule corresponding to any of thegenes disclosed in Table 2. A detection reagent in the form of apolynucleotide may optionally contain modified base analogs,intercalators or minor groove binders. Multiple detection reagents suchas probes may be, for example, affixed to a solid support (e.g., arraysor beads) or supplied in solution (e.g., probe/primer sets for enzymaticreactions such as PCR, RT-PCR, TaqMan assays, or primer-extensionreactions) to form an expression profiling kit.

A probe or primer typically is a substantially purified oligonucleotideor PNA oligomer. Such oligonucleotide typically comprises a region ofcomplementary nucleotide sequence that hybridizes under stringentconditions to at least about 8, 10, 12, 16, 18, 20, 22, 25, 30, 40, 50,55, 60, 65, 70, 80, 90, 100, 120 (or any other number in-between) ormore consecutive nucleotides in a target nucleic acid molecule.

Other preferred primer and probe sequences can readily be determinedusing the nucleotide sequences of genes disclosed in Table 2. It will beapparent to one of skill in the art that such primers and probes aredirectly useful as reagents for expression profiling of the genes of thepresent invention, and can be incorporated into any kit/system format.

In order to produce a probe or primer specific for a target genesequence, the gene/transcript sequence is typically examined using acomputer algorithm which starts at the 5′ or at the 3′ end of thenucleotide sequence. Typical algorithms will then identify oligomers ofdefined length that are unique to the gene sequence, have a GC contentwithin a range suitable for hybridization, lack predicted secondarystructure that may interfere with hybridization, and/or possess otherdesired characteristics or that lack other undesired characteristics.

A primer or probe of the present invention is typically at least about 8nucleotides in length. In one embodiment of the invention, a primer or aprobe is at least about 10 nucleotides in length. In a preferredembodiment, a primer or a probe is at least about 12 nucleotides inlength. In a more preferred embodiment, a primer or probe is at leastabout 16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 nucleotides in length.While the maximal length of a probe can be as long as the targetsequence to be detected, depending on the type of assay in which it isemployed, it is typically less than about 50, 60, 65, or 70 nucleotidesin length. In the case of a primer, it is typically less than about 30nucleotides in length. In a specific preferred embodiment of theinvention, a primer or a probe is within the length of about 18 andabout 28 nucleotides. However, in other embodiments, such as nucleicacid arrays and other embodiments in which probes are affixed to asubstrate, the probes can be longer, such as on the order of 30-70, 75,80, 90, 100, or more nucleotides in length.

The present invention encompasses nucleic acid analogs that containmodified, synthetic, or non-naturally occurring nucleotides orstructural elements or other alternative/modified nucleic acidchemistries known in the art. Such nucleic acid analogs are useful, forexample, as detection reagents (e.g., primers/probes) for detecting oneor more of the genes identified in Table 2. Furthermore, kits/systems(such as beads, arrays, etc.) that include these analogs are alsoencompassed by the present invention. For example, PNA oligomers thatare based on the polymorphic sequences of the present invention arespecifically contemplated. PNA oligomers are analogs of DNA in which thephosphate backbone is replaced with a peptide-like backbone (Lagriffoulet al., Bioorganic & Medicinal Chemistry Letters 4:1081-1082 [1994],Petersen et al., Bioorganic & Medicinal Chemistry Letters 6:793-796[1996], Kumar et al., Organic Letters 3[9]:1269-1272 [2001],WO96/04000). PNA hybridizes to complementary RNA or DNA with higheraffinity and specificity than conventional oligonucleotides andoligonucleotide analogs. The properties of PNA enable novel molecularbiology and biochemistry applications unachievable with traditionaloligonucleotides and peptides.

Additional examples of nucleic acid modifications that improve thebinding properties and/or stability of a nucleic acid include the use ofbase analogs such as inosine, intercalators (U.S. Pat. No. 4,835,263)such as ethidium bromide and SYBR® Green, and the minor groove binders(U.S. Pat. No. 5,801,115). Thus, references herein to nucleic acidmolecules, expression profiling reagents (e.g., probes and primers), andoligonucleotides/polynucleotides include PNA oligomers and other nucleicacid analogs. Other examples of nucleic acid analogs andalternative/modified nucleic acid chemistries known in the art aredescribed in Current Protocols in Nucleic Acid Chemistry, John Wiley &Sons, New York (2002).

While the design of each allele-specific primer or probe depends onvariables such as the precise composition of the nucleotide sequences ina target nucleic acid molecule and the length of the primer or probe,another factor in the use of primers and probes is the stringency of theconditions under which the hybridization between the probe or primer andthe target sequence is performed. Higher stringency conditions utilizebuffers with lower ionic strength and/or a higher reaction temperature,and tend to require a closer match between the probe/primer and targetsequence in order to form a stable duplex. If the stringency is toohigh, however, hybridization may not occur at all. In contrast, lowerstringency conditions utilize buffers with higher ionic strength and/ora lower reaction temperature, and permit the formation of stableduplexes with more mismatched bases between a probe/primer and a targetsequence. By way of example but not limitation, exemplary conditions forhigh-stringency hybridization conditions using an allele-specific probeare as follows: prehybridization with a solution containing 5× standardsaline phosphate EDTA (SSPE), 0.5% NaDodSO₄ (SDS) at 55° C., andincubating probe with target nucleic acid molecules in the same solutionat the same temperature, followed by washing with a solution containing2×SSPE, and 0.1% SDS at 55° C. or room temperature.

Moderate-stringency hybridization conditions may be used for primerextension reactions with a solution containing, e.g., about 50 mM KCl atabout 46° C. Alternatively, the reaction may be carried out at anelevated temperature such as 60° C. In another embodiment, amoderately-stringent hybridization condition is suitable foroligonucleotide ligation assay (OLA) reactions, wherein two probes areligated if they are completely complementary to the target sequence, andmay utilize a solution of about 100 mM KCl at a temperature of 46° C.

In a hybridization-based assay, specific probes can be designed thathybridize to a segment of target DNA of one gene sequence but do nothybridize to sequences from other genes. Hybridization conditions shouldbe sufficiently stringent that there is a significant detectabledifference in hybridization intensity between genes, and preferably anessentially binary response, whereby a probe hybridizes to only one ofthe gene sequences or significantly more strongly to one gene sequence.While a probe may be designed to hybridize to a target sequence of aspecific gene such that the target site aligns anywhere along thesequence of the probe, the probe is preferably designed to hybridize toa segment of the target sequence such that the gene sequence aligns witha central position of the probe (e.g., a position within the probe thatis at least three nucleotides from either end of the probe). This designof probe generally achieves good discrimination in hybridization betweendifferent genes.

Oligonucleotide probes and primers may be prepared by methods well knownin the art. Chemical synthetic methods include, but are not limited to,the phosphotriester method described by Narang et al., Methods inEnzymology 68:90 [1979]; the phosphodiester method described by Brown etal., Methods in Enzymology 68:109 [1979], the diethylphosphoamidatemethod described by Beaucage et al., Tetrahedron Letters 22:1859 [1981];and the solid support method described in U.S. Pat. No. 4,458,066. Inthe case of an array, multiple probes can be immobilized on the samesupport for simultaneous analysis of multiple different gene sequences.

In one type of PCR-based assay, a gene-specific primer hybridizes to aregion on a target nucleic acid molecule that overlaps a gene sequenceand only primes amplification of the gene sequence to which the primerexhibits perfect complementarity (Gibbs, Nucleic Acid Res. 17:2427-2448[1989]). Typically, the primer's 3′-most nucleotide is aligned with andcomplementary to the target nucleic acid molecule. This primer is usedin conjunction with a second primer that hybridizes at a distal site.Amplification proceeds from the two primers, producing a detectableproduct that indicates which gene/transcript is present in the testsample. This PCR-based assay can be utilized as part of the TaqManassay, described below.

The genes in the 14-gene signature described herein can be detected byany one of a variety of nucleic acid amplification methods, which areused to increase the copy numbers of a polynucleotide of interest in anucleic acid sample. Such amplification methods are well known in theart, and they include but are not limited to, polymerase chain reaction(PCR) (U.S. Pat. Nos. 4,683,195 and 4,683,202; PCR Technology:Principles and Applications for DNA Amplification, ed. H. A. Erlich,Freeman Press, New York, N.Y. [1992]), ligase chain reaction (LCR) (Wuand Wallace, Genomics 4:560 [1989]; Landegren et al., Science 241:1077[1988]), strand displacement amplification (SDA) (U.S. Pat. Nos.5,270,184 and 5,422,252), transcription-mediated amplification (TMA)(U.S. Pat. No. 5,399,491), linked linear amplification (LLA) (U.S. Pat.No. 6,027,923), and the like, and isothermal amplification methods suchas nucleic acid sequence based amplification (NASBA), and self-sustainedsequence replication (Guatelli et al., Proc. Natl. Acad. Sci. USA 87:1874 [1990]). Based on such methodologies, a person skilled in the artcan readily design primers in any suitable regions 5′ and 3′ of the genesequences of interest, so as to amplify the genes disclosed herein. Suchprimers may be used to reverse-transcribe and amplify DNA of any length,such that it contains the gene of interest in its sequence.

Generally, an amplified polynucleotide is at least about 16 nucleotidesin length. More typically, an amplified polynucleotide is at least about20 nucleotides in length. In a preferred embodiment of the invention, anamplified polynucleotide is at least about 30 nucleotides in length. Ina more preferred embodiment of the invention, an amplifiedpolynucleotide is at least about 32, 40, 45, 50, or 60 nucleotides inlength. In yet another preferred embodiment of the invention, anamplified polynucleotide is at least about 100, 200, 300, 400, or 500nucleotides in length. While the total length of an amplifiedpolynucleotide of the invention can be as long as an exon, an intron orthe entire gene, an amplified product is typically up to about 1,000nucleotides in length (although certain amplification methods maygenerate amplified products greater than 1,000 nucleotides in length).More preferably, an amplified polynucleotide is not greater than about150-250 nucleotides in length.

In an embodiment of the invention, a gene expression profiling reagentof the invention is labeled with a fluorogenic reporter dye that emits adetectable signal. While the preferred reporter dye is a fluorescentdye, any reporter dye that can be attached to a detection reagent suchas an oligonucleotide probe or primer is suitable for use in theinvention. Such dyes include, but are not limited to, Acridine, AMCA,BODIPY, Cascade Blue, Cy2, Cy3, Cy5, Cy7, Dabcyl, Edans, Eosin,Erythrosin, Fluorescein, 6-Fam, Tet, Joe, Hex, Oregon Green, Rhodamine,Rhodol Green, Tamra, Rox, and Texas Red.

In yet another embodiment of the invention, the detection reagent may befurther labeled with a quencher dye such as Tamra, especially when thereagent is used as a self-quenching probe such as a TaqMan (U.S. Pat.Nos. 5,210,015 and 5,538,848) or Molecular Beacon probe (U.S. Pat. Nos.5,118,801 and 5,312,728), or other stemless or linear beacon probe(Livak et al., PCR Method Appl. 4:357-362 [1995]; Tyagi et al., NatureBiotechnology 14:303-308 [1996]; Nazarenko et al., Nucl. Acids Res.25:2516-2521 [1997]; U.S. Pat. Nos. 5,866,336 and 6,117,635).

The detection reagents of the invention may also contain other labels,including but not limited to, biotin for streptavidin binding, haptenfor antibody binding, and oligonucleotide for binding to anothercomplementary oligonucleotide such as pairs of zipcodes.

Gene Expression Kits and Systems

A person skilled in the art will recognize that, based on the gene andsequence information disclosed herein, expression profiling reagents canbe developed and used to assay any genes of the present inventionindividually or in combination, and such detection reagents can bereadily incorporated into one of the established kit or system formatswhich are well known in the art. The terms “kits” and “systems,” as usedherein in the context of gene expression profiling reagents, areintended to refer to such things as combinations of multiple geneexpression profiling reagents, or one or more gene expression profilingreagents in combination with one or more other types of elements orcomponents (e.g., other types of biochemical reagents, containers,packages such as packaging intended for commercial sale, substrates towhich gene expression profiling reagents are attached, electronichardware components, etc.). Accordingly, the present invention furtherprovides gene expression profiling kits and systems, including but notlimited to, packaged probe and primer sets (e.g., TaqMan probe/primersets), arrays/microarrays of nucleic acid molecules, and beads thatcontain one or more probes, primers, or other detection reagents forprofiling one or more genes of the present invention. The kits/systemscan optionally include various electronic hardware components; forexample, arrays (“DNA chips”) and microfluidic systems (“lab-on-a-chip”systems) provided by various manufacturers typically comprise hardwarecomponents. Other kits/systems (e.g., probe/primer sets) may not includeelectronic hardware components, but may be comprised of, for example,one or more gene expression profiling reagents (along with, optionally,other biochemical reagents) packaged in one or more containers.

In some embodiments, a gene expression profiling kit typically containsone or more detection reagents and other components (e.g., a buffer,enzymes such as reverse transcriptase, DNA polymerases or ligases,reverse transcription and chain extension nucleotides such asdeoxynucleotide triphosphates, and in the case of Sanger-type DNAsequencing reactions, chain terminating nucleotides, positive controlsequences, negative control sequences, and the like) necessary to carryout an assay or reaction, such as reverse transcription, amplificationand/or detection of a gene-containing nucleic acid molecule. A kit mayfurther contain means for determining the amount of a target nucleicacid, and means for comparing the amount with a standard, and cancomprise instructions for using the kit to detect the gene-containingnucleic acid molecule of interest. In one embodiment of the presentinvention, kits are provided which contain the necessary reagents tocarry out one or more assays to profile the expression of one or more ofthe genes disclosed herein. In a preferred embodiment of the presentinvention, gene expression profiling kits/systems are in the form ofnucleic acid arrays, or compartmentalized kits, includingmicrofluidic/lab-on-a-chip systems.

Gene expression profiling kits/systems may contain, for example, one ormore probes, or pairs of probes, that hybridize to a nucleic acidmolecule at or near each target gene sequence position. Multiple pairsof gene-specific probes may be included in the kit/system tosimultaneously assay large numbers of genes, at least one of which is agene of the present invention. In some kits/systems, the gene-specificprobes are immobilized to a substrate such as an array or bead. Forexample, the same substrate can comprise gene-specific probes fordetecting at least 1 or substantially all of the genes shown in Table 2,or any other number in between.

The terms “arrays,” “microarrays,” and “DNA chips” are used hereininterchangeably to refer to an array of distinct polynucleotides affixedto a substrate, such as glass, plastic, paper, nylon or other type ofmembrane, filter, chip, or any other suitable solid support. Thepolynucleotides can be synthesized directly on the substrate, orsynthesized separate from the substrate and then affixed to thesubstrate. In one embodiment, the microarray is prepared and usedaccording to the methods described in U.S. Pat. No. 5,837,832 (Chee etal.), PCT application WO95/11995 (Chee et al.), Lockhart, D. J. et al.(Nat. Biotech. 14:1675-1680 [1996]) and Schena, M. et al. (Proc. Natl.Acad. Sci. 93:10614-10619 [1996]), all of which are incorporated hereinin their entirety by reference. In other embodiments, such arrays areproduced by the methods described by Brown et al., U.S. Pat. No.5,807,522.

Nucleic acid arrays are reviewed in the following references: Zammatteoet al., “New chips for molecular biology and diagnostics,” Biotechnol.Annu. Rev. 8:85-101 (2002); Sosnowski et al., “Active microelectronicarray system for DNA hybridization, genotyping and pharmacogenomicapplications,” Psychiatr. Genet. 12(4):181-92 (December 2002); Heller,“DNA microarray technology: devices, systems, and applications,” Annu.Rev. Biomed. Eng. 4:129-53 (2002); Epub Mar. 22, 2002; Kolchinsky etal., “Analysis of SNPs and other genomic variations using gel-basedchips,” Hum. Mutat. 19(4):343-60 (April 2002); and McGall et al.,“High-density genechip oligonucleotide probe arrays,” Adv. Biochem. Eng.Biotechnol. 77:21-42 (2002).

Any number of probes, such as gene-specific probes, may be implementedin an array, and each probe or pair of probes can hybridize to adifferent gene sequence position. In the case of polynucleotide probes,they can be synthesized at designated areas (or synthesized separatelyand then affixed to designated areas) on a substrate using alight-directed chemical process. Each DNA chip can contain, for example,thousands to millions of individual synthetic polynucleotide probesarranged in a grid-like pattern and miniaturized (e.g., to the size of adime). Preferably, probes are attached to a solid support in an ordered,addressable array.

A microarray can be composed of a large number of unique,single-stranded polynucleotides, usually either synthetic antisensepolynucleotides or fragments of cDNAs, fixed to a solid support. Typicalpolynucleotides are preferably about 6-60 nucleotides in length, morepreferably about 15-30 nucleotides in length, and most preferably about18-25 nucleotides in length. For certain types of microarrays or otherdetection kits/systems, it may be preferable to use oligonucleotidesthat are only about 7-20 nucleotides in length. In other types ofarrays, such as arrays used in conjunction with chemiluminescentdetection technology, preferred probe lengths can be, for example, about15-80 nucleotides in length, preferably about 50-70 nucleotides inlength, more preferably about 55-65 nucleotides in length, and mostpreferably about 60 nucleotides in length. The microarray or detectionkit can contain polynucleotides that cover the known 5′ or 3′ sequenceof a gene/transcript, sequential polynucleotides that cover thefull-length sequence of a gene/transcript; or unique polynucleotidesselected from particular areas along the length of a targetgene/transcript sequence, particularly areas corresponding to one ormore genes disclosed in Table 2. Polynucleotides used in the microarrayor detection kit can be specific to a gene or genes of interest (e.g.,specific to a particular signature sequence within a target genesequence, or specific to a particular gene sequence at multipledifferent sequence sites), or specific to a polymorphic gene/transcriptor genes/transcripts of interest.

Hybridization assays based on polynucleotide arrays rely on thedifferences in hybridization stability of the probes to perfectlymatched and mismatched target sequences.

In other embodiments, the arrays are used in conjunction withchemiluminescent detection technology. The following patents and patentapplications, which are all herein incorporated by reference in theirentirety, provide additional information pertaining to chemiluminescentdetection: U.S. patent application Ser. Nos. 10/620,332 and 10/620,333describe chemiluminescent approaches for microarray detection; U.S. Pat.Nos. 6,124,478, 6,107,024, 5,994,073, 5,981,768, 5,871,938, 5,843,681,5,800,999, and 5,773,628 describe methods and compositions of dioxetanefor performing chemiluminescent detection; and U.S. publishedapplication US2002/0110828 discloses methods and compositions formicroarray controls.

In one embodiment of the invention, a nucleic acid array can comprise anarray of probes of about 15-25 nucleotides in length. In furtherembodiments, a nucleic acid array can comprise any number of probes, inwhich at least one probe is capable of detecting one or more genesdisclosed in Table 2, and/or at least one probe comprises a fragment ofone of the gene sequences selected from the group consisting of thosedisclosed in Table 2, and sequences complementary thereto, said fragmentcomprising at least about 8 consecutive nucleotides, preferably 10, 12,15, 16, 18, 20, more preferably 22, 25, 30, 40, 47, 50, 55, 60, 65, 70,80, 90, 100, or more consecutive nucleotides (or any other numberin-between) and containing (or being complementary to) a sequence of agene disclosed in Table 2. In some embodiments, the nucleotidecomplementary to the gene site is within 5, 4, 3, 2, or 1 nucleotidefrom the center of the probe, more preferably at the center of saidprobe.

A polynucleotide probe can be synthesized on the surface of thesubstrate by using a chemical coupling procedure and an ink jetapplication apparatus, as described in PCT application WO95/251116(Baldeschweiler et al.) which is incorporated herein in its entirety byreference. In another aspect, a “gridded” array analogous to a dot (orslot) blot may be used to arrange and link cDNA fragments oroligonucleotides to the surface of a substrate using a vacuum system,thermal, UV, mechanical or chemical bonding procedures. An array, suchas those described above, may be produced by hand or by using availabledevices (slot blot or dot blot apparatus), materials (any suitable solidsupport), and machines (including robotic instruments), and may contain8, 24, 96, 384, 1536, 6144 or more polynucleotides, or any other numberwhich lends itself to the efficient use of commercially availableinstrumentation.

Using such arrays or other kits/systems, the present invention providesmethods of identifying and profiling expression of the genes disclosedherein in a test sample. Such methods typically involve incubating atest sample of nucleic acids with an array comprising one or more probescorresponding to at least one gene sequence position of the presentinvention, and assaying for binding of a nucleic acid from the testsample with one or more of the probes. Conditions for incubating a geneexpression profiling reagent (or a kit/system that employs one or moresuch gene expression profiling reagents) with a test sample vary.Incubation conditions depend on factors such as the format employed inthe assay, the profiling methods employed, and the type and nature ofthe profiling reagents used in the assay. One skilled in the art willrecognize that any one of the commonly available hybridization,amplification and array assay formats can readily be adapted to detectthe genes disclosed herein.

A gene expression profiling kit/system of the present invention mayinclude components that are used to prepare nucleic acids from a testsample for the subsequent reverse transcription, RNA enrichment,amplification and/or detection of a gene sequence-containing nucleicacid molecule. Such sample preparation components can be used to producenucleic acid extracts (including DNA, cDNA and/or RNA) from any tumortissue source, including but not limited to, fresh tumor biopsy, frozenor foramalin-fixed paraffin embedded (FFPE) tissue specimens, or tumorscollected and preserved by any method. The test samples used in theabove-described methods will vary based on such factors as the assayformat, nature of the profiling method, and the specific tissues, cellsor extracts used as the test sample to be assayed. Methods of preparingnucleic acids are well known in the art and can be readily adapted toobtain a sample that is compatible with the system utilized. Automatedsample preparation systems for extracting nucleic acids from a testsample are commercially available, and examples are Qiagen's BioRobot9600, Applied Biosystems' PRISM 6700, and Roche Molecular Systems' COBASAmpliPrep System.

Another form of kit contemplated by the present invention is acompartmentalized kit. A compartmentalized kit includes any kit in whichreagents are contained in separate containers. Such containers include,for example, small glass containers, plastic containers, strips ofplastic, glass or paper, or arraying material such as silica. Suchcontainers allow one to efficiently transfer reagents from onecompartment to another compartment such that the test samples andreagents are not cross-contaminated, or from one container to anothervessel not included in the kit, and the agents or solutions of eachcontainer can be added in a quantitative fashion from one compartment toanother or to another vessel. Such containers may include, for example,one or more containers which will accept the test sample, one or morecontainers which contain at least one probe or other gene expressionprofiling reagent for profiling the expression of one or more genes ofthe present invention, one or more containers which contain washreagents (such as phosphate buffered saline, Tris-buffers, etc.), andone or more containers which contain the reagents used to reveal thepresence of the bound probe or other gene expression profiling reagents.The kit can optionally further comprise compartments and/or reagentsfor, for example, reverse transcription, RNA enrichment, nucleic acidamplification or other enzymatic reactions such as primer extensionreactions, hybridization, ligation, electrophoresis (preferablycapillary electrophoresis), mass spectrometry, and/or laser-inducedfluorescent detection. The kit may also include instructions for usingthe kit. Exemplary compartmentalized kits include microfluidic devicesknown in the art (see, e.g., Weigl et al., “Lab-on-a-chip for drugdevelopment,” Adv. Drug Deliv. Rev. 24, 55[3]:349-77 [February 2003]).In such microfluidic devices, the containers may be referred to as, forexample, microfluidic “compartments,” “chambers,” or “channels.”

Uses of Gene Expression Profiling Reagents

The nucleic acid molecules in Table 3 of the present invention have avariety of uses, especially in the prognosis of breast cancermetastasis. For example, the nucleic acid molecules are useful asamplification primers or hybridization probes, such as for expressionprofiling using messenger RNA, transcript RNA, cDNA, genomic DNA,amplified DNA or other nucleic acid molecules, and for isolatingfull-length cDNA and genomic clones encoding the genes disclosed inTable 2 as well as their orthologs.

A probe can hybridize to any nucleotide sequence along the entire lengthof a nucleic acid molecule. Preferably, a probe of the present inventionhybridizes to a region of a target sequence that encompasses a genesequence of the genes indicated in Table 2. More preferably, a probehybridizes to a gene-containing target sequence in a sequence-specificmanner such that it distinguishes the target sequence from othernucleotide sequences which vary from the target sequence. Such a probeis particularly useful for detecting the presence of a gene-containingnucleic acid in a test sample.

Thus, the nucleic acid molecules of the invention can be used ashybridization probes, reverse transcription and/or amplification primersto detect and profile the expression levels of the genes disclosedherein, thereby determining the probability of whether an individualwith breast cancer and said expression profile is at risk for distantmetastasis. Expression profiling of disclosed genes provides aprognostic tool for a distant metastasis.

Generation of the Metastasis Score

Expression levels of the fourteen genes disclosed in Table 2 can be usedto derive a metastasis score (MS) predictive of metastasis risk.Expression levels may be calculated by the Δ(ΔC_(t)) method, whereCt=the threshold cycle for target amplification; i.e., the cycle numberin PCR at which time exponentional amplification of target begins. (K JLivak and T D Schmittgen, 2001, Methods 25:402-408). The level of mRNAof each of the 14 profiled genes may defined as:

Δ(ΔCt)=(Ct _(GOI) −Ct _(EC))_(test RNA)−(Ct _(GOI) −Ct _(EC))_(ref RNA)

where GOI=gene of interest (each of 14 signature genes), test RNA=RNAobtained from the patient sample, ref RNA=a calibrator reference RNA,and EC=an endogenous control. The expression level of each signaturegene may be first normalized to the three endogenous control genes,listed in Table 2 (EC). A Ct representing the average of the Ctsobtained from amplification of the three endogenous controls (Ct_(EC))can be used to minimize the risk of normalization bias that would occurif only one control gene were used. (T. Suzuki, P J Higgins et al.,2000, Biotechniques 29:332-337). Primers that may preferably be used toamplify the endogenous control genes are listed in Table 3; but primerspossible for amplifying these endogenous controls are not limited tothese disclosed oligonucleotides. The adjusted expression level of thegene of interest may be further normalized to a calibrator reference RNApool, ref RNA (universal human reference RNA, Stratagene, La Jolla,Calif.). This can be used to standardize expression results obtainedfrom various machines.

The Δ(ΔCt) value, obtained in gene expression profiling for each of the14 signature genes, may be used in the following formula to generate ametastasis score (MS):

${MS} = {{a\; 0} + {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}}}$

in which Gi represents the expression level of each gene (i) of the14-gene prognostic signature. The value of Gi is the Δ(ΔCt) obtained inexpression profiling described above. The constant ai for each gene i isprovided in Table 2. The constant a0=0.022; this centers the MS so thatits median value is zero. M is the number of genes in the componentlist; in this case fourteen. Thus, the MS is a measure of the summationof expression levels for the 14 genes disclosed in Table 2, eachmultiplied by a particular constant ai, also in Table 2, and finallythis summation is added to the centering constant 0.022 to derive theMS.

Alternatively, the Δ(ΔCt) value, obtained in gene expression profilingfor each of the 14 signature genes, may be used in the following formulato generate a metastasis score (MS):

${MS} = {{a\; 0} + {b*{\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}}}}$

in which Gi represents the standardized expression level of each gene(i) of the 14-gene prognostic signature. The value of Gi is obtained bysubtracting the mean gene expression from the original expression levelmeasured in Δ(ΔCt) obtained in expression profiling described above andthen divided by the standard deviation of the gene expression in thetraining set. The constant ai for each gene i is provided in Table 2.The constant b was −0.251. It was from a univariate Cox model with theprincipal component as a predictor, to get the correct sign and scaling.The constant a0=0.022; this centers the MS so that its median value iszero. M is the number of genes in the component list; in this casefourteen. Thus, the MS is a measure of the summation of expressionlevels for the 14 genes disclosed in Table 2, each multiplied by aparticular constant ai, also in Table 2. This summation is multiplied bya constant b and the centering constant 0.022 is then added to derivethe MS.

Any new sample may be evaluated by generating this metastasis score fromthe 14-gene expression profiling data for that patient, and from thisscore the probability of distant metastasis for the patient can bedetermined.

Note that the MS score can be simply a sum of the values of Δ(ΔCt) asdescribed above, in which case the formula of the MS is simplified bysubstituting the value of a0 with zero, and the constant ai is one.

Note that the MS score can also be simply a sum of the values of Δ(ΔCt)as described above, then multiplied by the constant −0.04778 for correctsign and scaling such that distant metastasis risk increases withincrease of MS. Finally the constant 0.8657 is added so that the mean ofMS is zero. MS score derived in this alternative way will have equalweighting of all fourteen genes. The risk of distant metastasis wouldincrease as MS increase. The two different MS scores described here havevery high correlation with Pearson correlation coefficient greater than0.999.

Generation of Distant Metastasis Probability from MS

The probability of distant metastasis for any individual patient can becalculated from the MS at variable time points, using the Weibulldistribution as the baseline survival function.

The metastasis score (MS) obtained above, from expression profiling ofthe 14-gene signature, was converted into the probability of distantmetastasis by means of the Cox proportional hazard model. Because theCox model does not specify the baseline hazard function, the hazard andsurvivor functions were first constructed through parametric regressionmodels. In the parametric regression models, distant metastasis-freesurvival time was the outcome, and the metastasis score (MS) was theindependent variable input. The event time was assumed to have a Weibulldistribution; its two parameters were estimated using the survival datafrom which the MS was derived. To calculate the probability of distantmetastasis within a certain time for a patient, the MS value is simplysubstituted into the formula for the survivor function.

Clinical Application of the MS Score in Risk Determination

One way of using the MS score in determining the risk for metastasis isto generate one or more MS Threshold, also known as MS “cut point”. SuchMS Threshold can be used as a benchmark when compared to the MS score ofa breast cancer patient so as to determine whether such patient haseither an increased or decreased risk. MS Threshold can be determined bydifferent methods and are different for different definition ofMetastasis Score. For MS defined in Equation 1 that was used in Examples1, 2 and 3, MS Threshold was determined from hazard ratios of high-riskvs. low-risk groups. Kaplan-Meier (KM) curves for distantmetastasis-free survival are generated for the high- and low-riskpatient groups defined by MS cut points. The choice of median MS as cutpoint is based upon the calculation of the hazard ratios of the high vs.low-risk groups using different cut points from ten percentile of MS toninety percentile of MS. The median cut point can be defined as thepoint where there are an equal number of individuals in the high andlow-risk groups, and is found to produce near the highest hazard ratioin the training samples as described in Example 1. Hazard ratios (HR)and 95% confidence intervals (CI) using the cut point of median MS canbe calculated and reported. Log rank tests are performed, and the hazardratios are calculated for different cut points. The accuracy and valueof the 14-gene signature in predicting distant metastasis at five yearscan be assessed by various means. (X H Zhou, N. Obuchowski et al., eds.,2002, Statistical Methods in Diagnostic Medicine, Wiley-Interscience,New York). For MS defined in Equation 2 that was used in Example 4 and5, MS Threshold was determined from sensitivity and specificity of MS topredict distant metastasis in 5 year in samples from Guy's Hospitaldescribed in Example 2. Two MS cut points are chosen as such thesensitivity of MS to predict distant metastasis in 5 years is over 90%if the first cut point is used. The second cut point is chosen such thatthe sensitivity and specificity of MS to predict distant metastasis in 5years will be both at 70%. For MS defined in Equation 2, the first MSThreshold is −0.1186 and the second MS Threshold is 0.3019. With two MScut points, there are high, intermediate and low MS groups. In treatedsamples from Guy's Hospital and treated samples from Aichi Cancer Centerin Japan, the high MS group is designated as high-risk group and theintermediate and low MS groups are designated as low-risk group.

EXAMPLES

The following examples are offered to illustrate, but not to limit theclaimed invention.

Example One The mRNA Expression Levels of a 14-Gene Prognostic SignaturePredict Risk for Distant Metastasis in 142 Lymph Node-Negative,ER-Positive Breast Cancer Patients

The following example illustrates how a 14-gene prognostic signature wasidentified and how it can be used in determining prognosis for distantmetastasis in breast cancer patients, even in routine clinicallaboratory testing. A clinician can perform mRNA expression profiling onthe 14 genes described herein, using RNA obtained from a number of meanssuch as biopsy, FFPE, frozen tissues, etc., and then insert theexpression data into an algorithm provided herein to determine aprognostic metastasis score.

FFPE tissue sections obtained from node-negative, ER-positive breastcancer patients were used in the example described below. An initial setof 200 genes were analyzed to derive the final 14-gene signature.Included as candidate genes for this signature were genes previouslyreported in the literature. Also in this example, the extent of overlapof this signature with routinely used prognostic factors and tools wasdetermined.

Tumors from node-negative, ER-positive patients were selected for thisstudy because prognostic information for node-negative patients would beof great value in guiding treatment strategies. Also, microarray studiesindicate that this tumor subset is clinically distinct from other typesof breast cancer tumors. (T. Sorlie, C M Perou et al., 2001, Proc NatlAcad Sci USA 98:10869-10874; C. Sotiriou, S Y Neo et al., 2003, ProcNatl Acad Sci USA 100:10393-10398). Genes were chosen for expressionprofiling from the gene signatures reported by H. Dai (H. Dai, L J van'tVeer et al., 2005, Cancer Res 15:4059-4066), L J van't Veer (L J van'tVeer, H. Dai et al., 2002, Nature 415:530-536), and S P Paik (SP Paik,S. Shak et al., 2004, N Engl J Med 351:2817-2826), in FFPE sections todetermine the robustness of these genes and the extent to whichroutinely collected and stored clinical samples could be used forprognosis of metastasis. From the gene expression data a metastasisscore was developed to estimate distant metastasis probability inindividual patients for any timeframe.

Patients and Samples

A total of 142 node-negative, ER-positive patients with early stagebreast cancer were selected, all from patients untreated with systemicadjuvant therapy (Training samples in Table 1). By limiting the study toa subset of breast cancer cases, a molecular signature was identifiedwith a more compelling association with metastasis, more robust acrossdifferent sample sets, and comprising a smaller number of genes so as tobetter facilitate translation to routine clinical practice. The mean ageof the patients was approximately 62 years (ranging from 31-89 years).

A highly-characterized breast tumor sample set served as the source ofsamples for this study; the set accrued from 1975 to 1986 at theCalifornia Pacific Medical Center (CPMC). The inclusion criteria for theprimary study included samples from tumors from patients who werelymph-node negative, had received no systemic therapy, and receivedfollow-up care for eight years.

Samples were approved for use in this study by the respectiveinstitutional medical ethics committees. Patients providing samples wereclassified as ER-positive based on a measurement of the expression levelof the ESR1 gene. Expression level of the ESR1 gene correlates well withan individual's ER status. (M. Cronin, M. Pho et al. 2004, Am J Pathol164(1):35-42; J M Knowlden, J M Gee et al., 1997, Clin Cancer Res3:2165-2172).

Distant metastasis-free survival was chosen as the primary endpointbecause it is most directly linked to cancer-related death. A secondaryendpoint was overall survival.

Sample Processing

Four 10 μm sections from each paraffin block were used for RNAextraction. The tumor regions were removed based on a guide slide wherethe cancer cell areas have been marked by a pathologist, and the RNAextracted using Pinpoint Slide RNA Isolation System II (Zymo Research,Orange Calif.).

The yields of total RNA varied between samples. In order to increase theamount of RNA available for analysis, a T7 RNA polymerase linearamplification method was performed on the extracted RNA. RNA isolatedfrom FFPE samples was subjected to T7-based RNA amplification using theMessageAmpII aRNA amplification kit (Ambion, Austin, Tex.).

To assess the consistency of gene expression before and after RNAamplification, a number of experiments were conducted on various genesin different samples. Amplification was first performed on RNA from 67FFPE samples that were not a part of this study, using 0.1-100 ng oftotal RNA. Profiling of 20 genes was performed using the resultantenriched RNA and the original, unenriched RNA. These comparisonsrevealed that the fold enrichment varied from gene to gene; however, therelative expression level was consistent before and after RNAamplification in all 20 genes for 67 samples.

RNA for this study was enriched by amplification with the MessageAmpIIaRNA amplification kit, as described above. Total RNA was quantifiedusing spectrophotometric measurements (OD₂₆₀).

Gene Expression Profiling

Based on a survey of the published literature and results ofmicroarray-based gene expression profiling experiments, 200 candidategenes were initially selected for analysis in order to determine theoptimal prognostic signature. This set included genes from the 70-geneprognosis panel described by van't Veer et al. (L J van't Veer, H. Daiet al., 2002, Nature 415:530-536), 104 genes analyzed by Dai et al. (H.Dai, L J van't Veer et al., 2005, Cancer Res 15:4059-4066), the 16-genepanel comprising the signature for response to Tamoxifen treatmentreported by Paik et al. (SP Paik, S. Shak et al., 2004, N Engl J Med351:2817-2826), and 24 ER-related genes as reported by West et al. (M.West, C. Blanchette et al., 2001, Proc Natl Acad Sci USA98:11462-11467).

Additional genes were selected as endogenous controls (EC) fornormalizing expression data, according to the method described in J.Vandesompele, K. De Preter et al., Genome Biol 3(7): Research0034.1-0034.11 (Epub 2002). Endogenous controls are also called“housekeeping genes” herein. Six endogenous control genes were testedfor the stability of their expression levels in 150 samples of frozenbreast cancer tumors. Expression data were analyzed using the geNormprogram of Vandesompele et al., in which an M value was determined as ameasurement of the stability of a gene's expression level. (J.Vandesompele, K. De Preter et al., Genome Biol 3(7): Research0034.1-0034.11, Epub 2002). The lower the M value, the more stable thegene. Results are shown in Table 7. The M values indicated that PPIG,SLU7 and NUP214 were the most stable endogenous control genes in thissample set, with the least variation in gene expression across samplestested. The stability of these three genes was validated on 138 breastcancer tumor FFPE samples. The results are shown in Table 8.

The expression levels of the selected 200 genes, together with the threeEC genes, were profiled in 142 RNA samples. For gene expressionprofiling, relative quantification by means of one-stepreverse-transcription polymerase chain reaction (RT-PCR) was performed.Quantification was “relative” in that the expression of the target genewas evaluated relative to the expression of a set of reference, stablyexpressed control genes. SYBR® Green intercalating dye (Stratagene, LaJolla, Calif.) was used to visualize amplification product duringreal-time PCR. Briefly, the reaction mix allowed for reversetranscription of extracted sample RNA into cDNA. This cDNA was then PCRamplified in the same reaction tube, according to the cycling parametersdescribed below. PCR conditions were designed so as to allow the primersdisclosed in Table 3, upper and lower, to hybridize 5′ and 3′,respectively, of target sequences of the genes of interest, followed byextension from these primers to create amplification product inrepetitive cycles of hybridization and extension. PCR was conducted inthe presence of SYBR® Green, a dye which intercalates intodouble-stranded DNA, to allow for visualization of amplificationproduct. RT-PCR was conducted on the Applied Biosystems Prism® 7900HTSequence Detection System (Applied Biosystems, Foster City, Calif.),which detected the amount of amplification product present at periodiccycles throughout PCR, using amount of intercalated SYBR® Green as anindirect measure of product. (The fluorescent intensity of SYBR® Greenis enhanced over 100-fold in binding to DNA.) PCR primers were designedso as to amplify all known splice-variants of each gene, and so that thesize of all PCR products would be shorter than 150 base pairs in length,to accommodate the degraded, relatively shorter-length RNA expected tobe found in FFPE samples. Primers used in the amplification of the 14genes in the molecular signature described herein and three endogenouscontrol genes are listed in Table 3. RT-PCR amplifications wereperformed in duplicate, in 384-well amplification plates. Each wellcontained a 15 ul reaction mix. The cycle profile consisted of: twominutes at 50° C., one minute at 95° C., 30 minutes at 60° C., followedby 45 cycles of 15 seconds at 95° C. and 30 seconds at 60° C., andending with an amplification product dissociation analysis. The PCRcomponents were essentially as described in L. Rogge, E. Bianchi et al.,2000, Nat Genet 25:96-101.

The relative changes in gene expression were determined by quantitativePCR. Expression levels were calculated by the Δ(ΔC_(t)) method, whereCt=the threshold cycle for target amplification; i.e., the cycle numberin PCR at which time exponentional amplification of target begins. (K JLivak and T D Schmittgen, 2001, Methods 25:402-408). The relative levelof mRNA of a gene of interest was defined as:

Δ(ΔCt)=(Ct _(GOI) −Ct _(EC))_(test RNA)−(Ct _(GOI) −Ct _(EC))_(ref RNA)

where GOI=gene of interest, test RNA=sample RNA, ref RNA=calibratorreference RNA, and EC=endogenous control. The expression level of everygene of interest was first normalized to the three endogenous controlgenes. A Ct representing the average of the three endogenous controls(Ct_(EC)) was used to minimize the risk of normalization bias that wouldoccur if only one control gene was used. (T. Suzuki, P J Higgins et al.,2000, Biotechniques 29:332-337). Primers used to amplify the endogenouscontrols are listed in Table 3. The adjusted expression level of thegene of interest was further normalized to a calibrator reference RNApool, ref RNA (universal human reference RNA, Stratagene, La Jolla,Calif.). This was used in order to standardize expression resultsobtained from various machines. The Δ(ΔCt) values obtained in expressionprofiling experiments of 200 genes were used in the statistical analysisdescribed below to determine the 14-gene prognostic signature of thisinvention.

Determination of the 14-Gene Signature

Using data from expression profiling of the original 200 genes (i.e.,the Δ(ΔCt) values obtained above), a semi-supervised principal component(SPC) method of determining survival time to distant metastasis was usedto develop a list of genes that would comprise a prognostic signature.(E. Bair and R. Tibshirani, 2004, PloS Biology 2:0511-0522). SPCcomputation was performed using the PAM application, available onlinevia the lab of R. Tibshirani at Standford University, Stanford, Calif.according to the method of R. Tibshirani, T J Hastie et al. 2002, PNAS99:6567-6742.

Genes were first ranked according to their association with distantmetastasis, using the univariate Cox proportional hazards model. Thosegenes with a P value<0.05 were considered significant. For any cutoff inthe Cox score, SPC computed the component of genes (i.e., the principalcomponent) that reached the optimal threshold: SPC used internalcross-validation in conjunction with a Cox model (with the principalcomponent as the single variable) to select the optimal threshold. Thefirst principal component gene list obtained by SPC was significant, andwas used as a predictor in a univariate Cox model, in order to determinethe correct sign and scaling.

The principal component gene list as produced by SPC was further reducedby the Lasso regression method. (R. Tibshirani, 1996, J RoyalStatistical Soc B, 58:267-288). The Lasso regression was performed usingthe LARS algorithm. (B. Efron, T. Hastie et al., 2004, Annals ofStatistics 32:407-499; T. Hastie, R. Tibshirani et al., eds., 2002, TheElements of Statistical Learning, Springer, New York). The outcomevariable used in the LARS algorithm was the principal component asselected by SPC. The Lasso method selected a subset of genes that couldreproduce this score with a pre-specified accuracy.

Metastasis Score

The metastasis score (MS) has the form:

${MS} = {{a\; 0} + {\underset{i = 1}{\overset{M}{b*\sum}}\; {{ai}*{Gi}}}}$

Gi represents the standardized expression level of each Lasso-derivedgene (i) of the 14-gene prognostic signature. The value of Gi iscalculated from subtracting the mean gene expression of that gene in thewhole population from the Δ(ΔCt) obtained in expression profilingdescribed above and then divided by the standard deviation of that gene.The constant ai are the loadings on the first principal component of thefourteen genes listed in Table 2. The ai score for each gene i isprovided in Table 2. The constant b is −0.251 and it was from aunivariate Cox model with the principal component as a predictor, to getthe correct sign and scaling. The constant a0=0.022; this centers the MSso that its median value is zero. M is the number of genes in thecomponent list; in this case fourteen. Thus, the MS is a measure of thesummation of expression levels for the 14 genes disclosed in Table 2,each multiplied by a particular constant ai; the summation was thenmultipled by the constant b and finally, this summation added to thecentering constant 0.022.

The score is herein referred to as MS (all), as it was based on ananalysis of all 142 ER-positive individuals studied. Any new sample maybe evaluated by generating this metastasis score from the 14-geneexpression profiling data for that patient, and optionally from thisscore, the probability of distant metastasis for the patient can bedetermined.

Generation of Distant Metastasis Probability from MS

The probability of distant metastasis for any individual patient can becalculated from the MS at variable time points, using the Weibulldistribution as the baseline survival function.

The metastasis score (MS) obtained above, from expression profiling ofthe 14-gene signature, was converted into the probability of distantmetastasis by means of the Cox proportional hazard model. Because theCox model does not specify the baseline hazard function, the hazard andsurvivor functions were first constructed through parametric regressionmodels. In the parametric regression models, distant metastasis-freesurvival time was the outcome, and the metastasis score (MS) was theindependent variable input. The event time was assumed to have a Weibulldistribution; its two parameters were estimated using the survival datafrom which the MS was derived. To calculate the probability of distantmetastasis within a certain time for a patient, the MS value is simplysubstituted into the formula for the survivor function.

Pre-Validation

In this study, the 142 study patients were randomly divided into tensubsets. One subset was set aside and the entire SPC procedure wasperformed on the union of the remaining nine subsets. Genes wereselected and the prognosticator built upon the nine subsets was appliedto obtain the cross-validated metastasis score, MS (CV), for theremaining subset. This cross-validation procedure was carried out 10times until MS (CV) was filled in for all patients. By building up MS(CV) in this way, each 1/10^(th) piece did not directly use itscorresponding survival times, and hence can be considered unsupervised.

This resulted in a derived variable for all the individuals in thesample, and could then be tested for its performance and compared withother clinical variables. MS (all), however, was built upon all 142individuals tested, and would produce considerable bias if tested in thesame way.

MS (CV) was used to evaluate the accuracy of the 14-gene prognosticsignature when time-dependent area under ROC curve (AUC) was calculated(described below). MS (CV) was also used in the Cox regression modelswhen the 14-gene signature was combined with clinical predictors. MS(CV) should have one degree of freedom, in contrast to the usual(non-pre-validated) predictor. The non-pre-validated predictor has manymore degrees of freedom.

Statistical Analyses of the MS

Kaplan-Meier (KM) curves for distant metastasis-free and overallsurvival were generated for the high- and low-risk patient groups usingthe median of MS (CV) as the cut point (i.e., 50 percentile of MS (CV)).The choice of median MS as cut point was based upon the calculation ofthe hazard ratios of the high vs. low-risk groups using different cutpoints from ten percentile of MS to ninety percentile of MS. A balancednumber of high-risk and low-risk individuals as well as near the highesthazard ratio were the determining factors for choosing the median as thecut point.

Hazard ratios (HR) and 95% confidence intervals (CI) using the cut pointof median MS were calculated and reported. Log rank tests wereperformed, and the hazard ratios were calculated for different cutpoints. The accuracy and value of the 14-gene signature in predictingdistant metastasis at five years were assessed by various means. (X HZhou, N. Obuchowski et al., eds., 2002, Statistical Methods inDiagnostic Medicine, Wiley-Interscience, New York).

Univariate and multivariate Cox proportional hazards regressions wereperformed using age, tumor size, tumor grade and the 14-gene signature.Clinical subgroup analyses on the signature were also performed.Statistical analyses were performed using SAS® 9.1 statistical software(SAS Institute, Inc., Cary, N.C.), except for the statistical packagesnoted herein.

Multi-Gene Signature

Of the 200 candidate genes studied, 44 had unadjusted P values<0.05 in aunivariate Cox proportional hazards regression. Patients with poormetastasis prognosis showed an up-regulation of 37 genes, while sevengenes were down-regulated. The semi-supervised principal componentprocedure (SPC) in PAM yielded a prognosticator of 38 genes. The genelist was further reduced to 14 (Table 2) by using the Lasso regression,via the LARS algorithm. Table 2 provides a description of each gene.Hazard ratios (HR) at various cut points (i.e., percentile MS (CV)) werecalculated. The median of MS (CV) was chosen to classify patients intolow- and high-risk groups.

Results Distant-Metastasis-Free and Overall Survival Rates in Low-Riskand High-Risk Groups

There were 7 and 24 distant metastases in 71 low-risk and 71 high-riskpatients as defined by the median of the cross-validated MetastasisScore, MS (CV), in the training set. Kaplan Meier estimate (FIG. 1a )indicated significant differences in distant metastasis free survival(DMFS) between the two groups with a log-rank p-value of 0.00028. The5-year and 10-year DMFS rates (standard error) in the low-risk groupswere 0.96 (0.025) and 0.90 (0.037) respectively. For the high-riskgroup, the corresponding rates were 0.74 (0.053) and 0.62 (0.066). Foroverall survival (FIG. 1b ), there was also significant differencebetween the two groups (log-rank p-value=0.0048). The 5-year and 10-yearOS rates (standard error) in the low-risk groups were 0.90 (0.036) and0.78 (0.059) while the corresponding rates were 0.79 (0.049) and 0.48(0.070) in the high-risk group (Table 5).

Hazard Ratios from Univariate and Multivariate Cox Regression Models

The unadjusted hazard ratio of the high-risk vs. low-risk groups by MSto predict DMFS was 4.23 (95% CI=1.82 to 9.85) (Table 6) as indicated bythe univariate Cox regression analysis. In comparison, the high-risk vs.low-risk groups by tumor grade (medium+high grade vs. low grade) had anunadjusted hazard ratio of 2.18 (1.04-4.59). While tumor size wassignificant in predicting DMFS (p=0.05) with 7% increase in hazard percm increase in diameter, age was not a significant factor in thispatient set. In the multivariate Cox regression analyses, the 14-genemolecular signature risk group had a hazard ratio of 3.26 (1.26-8.38),adjusted by age at surgery, tumor size and grade. It was the onlysignificant risk factor (p=0.014) in the multivariate analyses.

Diagnostic Accuracy and Predictive Values

Diagnostic accuracy and predictive values of the 14-gene signature riskgroups to predict distant metastases within 5 years were summarized inTable 9. Sensitivity was 0.86 for those who had distant metastaseswithin 5 years while specificity was 0.57 for those who did not.Negative predictive value (NPV) was 96% and indicated that only 4% ofindividuals would have distant metastasis within 5 years when the genesignature indicated that she was in the low-risk group. Nevertheless,positive predictive value (PPV) was only 26%, indicating only 26% ofindividuals would develop distant metastasis while the molecularsignature indicated she was high-risk. The high NPV and low PPV werepartly attributed to the low prevalence of distant metastasis in 5years, which was estimated to be 0.15 in the current patient set.

Moreover, Receiver operating characteristics (ROC) curves of thecontinuous MS(CV) to predict distant metastases in 5 years was shown inFIG. 2. AUC was 0.76 (0.65-0.87) for predicting distant metastases in 5years. AUC for predicting death in 10 years was 0.61 (0.49-0.73). Onesided tests for AUC to be greater than 0.5 were significant withcorresponding p-values of <0.0001 and 0.04 for the metastasis and deathendpoints.

DISCUSSION

Pathway analyses revealed that the fourteen genes in the prognosticsignature disclosed herein are involved in a variety of biologicalfunctions, but a majority of the genes are involved with cellproliferation. Eleven of the fourteen genes are associated with the TP53and TNF signaling pathways that have been found to be coordinatelyover-expressed in tumors leading to poor outcome. BUB1, CCNB1, MYBL2,PKMYT1, PRR11 and ORC6L are cell cycle-associated genes. DIAPH is a geneinvolved in actin cytoskeleton organization and biogenesis. DC13 isexpected to be involved with the assembly of cytochrome oxidase.

Whereas previously reported studies were limited to the use of frozentissues as the source of RNA and were profiled on microarrays, theinvention described in this example demonstrates that real-time RT-PCRmay be used for gene profiling in FFPE tumor samples. Thus, it providesfor the use of archived breast cancer tissue sections from patients whohave extended-outcomes data that predate the routine use of adjuvanttherapy.

Distant metastasis-free survival is the prognostic endpoint for thestudy described in this example. A supervised principal components (SPC)method was used to build the 14-gene signature panel of the invention.The approach used in assembling the signature allowed the derivation ofa metastasis score (MS) that can translate an individual's expressionprofile into a measure of risk of distant metastasis, for any given timeperiod. The ability to quantify risk of metastasis for any timeframeprovides highly flexible prognosis information for patients andclinicians in making treatment decisions, because the risk tolerance andtime horizon varies among patients.

With the 14-gene molecular signature, high and low-risk groups hadsignificant differences in distant metastasis-free and overall survivalrates. This signature includes proliferation genes not routinely testedin breast cancer prognostics. The 14-gene signature has a ten geneoverlap with the 50-gene signature described by H. Dai, L J van't Veeret al. (2005, in Cancer Res 15:4059-4066). In contrast, only six genesoverlap with the 70-gene signature described by Dai, van't Veer et al.(2002, Nature 415:530-536). This may be explained by the fact that thatstudy analyzed a more heterogeneous group of patients, which includedboth ER-positive and negative patients. The signature described hereinhad two proliferation gene overlaps with the 16-gene signature describedby S P Paik, S. Shak et al., (2004, N Engl J Med 351:2817-2826).

The molecular signature described herein has independent prognosticvalue over traditional risk factors such as age, tumor size and grade,as indicated from multivariate analyses. This signature provides an evenmore compelling measure of prognosis when the tumor grade is low. Asreported by Dai et al., a subset of this patient group with low gradetumors may be at even higher risk of metastasis than previouslyestimated. (Dai et al., 2005, Cancer Res 15:4059-4066). The signaturedescribed herein also extends the confidence in the prognostic genesinitially reported by van't Veer et al. (2002, Nature 415:530-536) andDai et al. (2005, Cancer Res 15:4059-4066), who primarily used samplesfrom women less than 55 years of age, because this signature wasvalidated on patients with a broad age distribution (median 64 yearsold), which is similar to the general range of breast cancer patients.

The use of FFPE tissues to sample even smaller amounts of sectionedtumor than microarray studies using frozen tissue, corroborated a subsetof the genes on a different detection platform (quantitative PCR versusmicroarrays). This reiteration of results is consistent with the conceptdescribed by Bernards and Weinberg, that metastatic potential is aninherent characteristic of most of, rather than a small fraction of, thecells in a tumor. (R. Bernards and R A Weinberg, 2002, Nature 418:823).

The invention described herein also provides an objective estimate ofprognosticator performance by using the pre-validation techniqueproposed by Tibshirani and Efron. (R. Tibshirani, B. Efron, 2002,Statistical Applications in Genetics and Molecular Biology 1: article1). Several investigators have noted the importance of independentvalidation in increasingly large and characterized datasets. (R. Simon,J Clin Oncol, 2005, 23:7332-7341; D F Ransohoff, 2004, Nat Rev Cancer4:309-14; D F Hayes, B. Trock et al., 1998, Breast Cancer Res 52:305-319; D G Altman and P. Royston, 2000, Stat Med 19:453-73).

In the present invention, a unique 14-gene prognostic signature isdescribed that provides distinct information to conventional markers andtools and is not confounded with systemic treatment. While the signaturewas developed using FFPE sections and RT-PCR for early stage,node-negative, ER-positive patients, it may be used in conjunction withany method known in the art to measure mRNA expression of the genes inthe signature and mRNA obtained from any tumor tissue source, includingbut not limited to, FFPE sections, frozen tumor tissues and fresh tumorbiopsies. Based on the mRNA expression levels of the 14-gene signatureof the invention, a metastasis score can be calculated for quantifyingdistant metastasis risk for any individual breast cancer patient. Thus,the invention disclosed herein is amenable for use in routine clinicallaboratory testing of ER-positive breast cancer patients for anytimeframe.

Example Two The 14-Gene Signature Predicts Distant Metastasis inUntreated Node-Negative, ER-Positive Breast Cancer Patients Using 280FFPE Samples

Efforts were undertaken to validate the 14-gene expression signaturethat can predict distant metastasis in node-negative (N−), estrogenreceptor positive (ER+) breast cancer patients in an independent sampleset who had not received systemic treatment. Reference is made to theexperimental protocols and statistical analyses in Example 1, which wereused to assay the effectiveness of the 14-gene signature.

Patients & Methods

A retrospective search of the Breast Tissue and Data Bank at Guy'sHospital was made to identify a cohort of patients diagnosed withprimary breast cancer and who had definitive local therapy (breastconservation therapy or mastectomy) but no additional adjuvant systemictreatment. The study group was restricted to women diagnosed between1975 and 2001, with a clinical tumor size of 3 cm or less,pathologically uninvolved axillary lymph nodes, ER-positive tumor andwith more than 5 years follow-up. A total of 412 patients wereidentified who also had sufficient formalin fixed, paraffin embedded(FFPE) tissues available for RNA extraction. The use of patient materialand data for this study has been approved by Guy's Research EthicsCommittee.

From this group there was sufficient quantity and quality of mRNA toprofile tumors from 300 patients. Subsequently a further 20 cases wereexcluded from the study. Six patients had bilateral breast cancer priorto distant metastasis, 6 had a missing value in gene expression levelsand 8 tumors proved to be ER negative upon re-assessment using currenttechniques. Thus, in total 280 patients were included in the analyses(validation set from Guy's Hospital in Table 1). To assess selectionbias, we compared the 280 patients we analyzed with the 412 patients whowere identified to satisfy the inclusion criteria and had sufficientFFPE tissues for RNA extraction. No selection bias was detected as therewere no significant differences in age, tumor size and histologic gradebetween the two sets.

ER status on this group of patients had been re-evaluated usingcontemporary IHC assay. Allred score 3 or more were considered receptorpositive. Tumors were classified according to WHO guidelines (WorldHealth Organization, Geneva, Switzerland. Histological typing of breasttumours. Tumori 1982; 68:181), and histological grade established usingthe modified Bloom and Richardson method (Elston C W & Ellis I O.Pathological prognostic factors in breast cancer. I. The value ofhistological grade in breast cancer: experience from a large study withlong term follow-up. Histopathology 1991; 19:403-10).

Compared with the training set, the validation set was younger, withlarger tumors and with a larger proportion of high grade tumors (Table1). Tests for differences in those characteristics are highlysignificant (p<0.001).

Metastasis Score

MS in Equation 1 derived in the training samples in Example 1 wasapplied to the untreated patients from Guy's Hospital. In example 1, RNAhad been enriched, whereas in the case of untreated patients from Guy'sHospital, the RNA samples were not enriched.

To apply MS to the un-enriched samples, conversion factors betweenenriched and un-enriched samples were obtained from 93 training samplesin Example 1 for each of the genes in the molecular signature.

Results Distant Metastasis Free and Overall Survival Rates in Low-Riskand High-Risk Groups

There were 4 (5.6%) distant metastases in the 71 MS low-risk and 62(29.7%) distant metastases in the 209 MS high-risk patients,respectively. Kaplan Meier estimate (FIG. 3a ) indicated significantdifferences in DMFS between the two groups with a log-rank p-value of6.02e-5. The 5-year and 10-year DMFS rates (standard error) in thelow-risk group were 0.99 (0.014) and 0.96 (0.025) respectively. For thehigh-risk group, the corresponding survival rates were 0.86 (0.025) and0.76 (0.031) (Table 12).

For the Adjuvant! risk groups, Kaplan Meier curves of DMFS were shown inFIG. 3c . The 5-year and 10-year DMFS rates (standard error) were 0.96(0.023) and 0.93 (0.03) for the low-risk group and 0.87 (0.03) and 0.77(0.031) for the high-risk group, respectively. There were largerdifferences in survival rates between the high-risk and low-risk groupsdefined by MS than those defined by Adjuvant!.

For overall survival (OS), Kaplan Meier curves (FIG. 3b ) indicatedsignificant difference in OS rates between MS low-risk and high-riskgroups (log-rank p-value=0.00028). The 5-year and 10-year OS rates were0.97 (0.020) and 0.94 (0.028) for the low-risk group and 0.92 (0.019)and 0.71 (0.032) for the high-risk group, respectively. FIG. 3d showedthe Kaplan Meier curves for Adjuvant! to predict 5-year and 10-yearoverall survival. MS and Adjuvant! provide similar prognosticinformation for overall survival.

Hazard Ratios from Univariate and Multivariate Cox Regression Models

The unadjusted hazard ratio of the high-risk vs. low-risk groups by MSto predict time to distant metastases was 6.12 (95% CI=2.23 to 16.83)(Table 10). The unadjusted hazard ratio for MS risk groups is higherthan those for groups defined by Adjuvant!, age, tumor size andhistologic grade. Adjuvant! had the second highest hazard ratio of 2.63(95% CI 1.30-5.32). Risk groups by histologic grade and tumor size weresignificant in predicting DMFS, but not the age group.

Age group is the most significant prognostic factor in predicting OSwith an unadjusted hazard ratio of 2.9 (95% CI 2.03-4.18) (Table 10).Nevertheless, MS risk group can predict overall survival with HR of2.49.

In the multivariate Cox regression (Table 11a) of time to distantmetastases with the MS risk group and clinicopathological risk factorsof age, tumor size and histological grade, the hazard ratio of the MSrisk group, adjusted by age, tumor size and histologic grade, is 4.81(1.71-13.53, p=0.003). MS risk group was the only significant riskfactor in the multivariate analysis. Therefore, the gene signature hasindependent prognostic value for DMFS over the traditionalclinicopathological risk factors and captures part of information withinthese factors.

In the multivariate Cox regression of time to distant metastases withthe risk groups by the 14-gene signature and by Adjuvant! (Table 11b),the corresponding adjusted hazard ratios were 5.32 (1.92-14.73) and 2.06(1.02-4.19). Both MS and Adjuvant! risk groups remained significant topredict DMFS. This indicates that MS and Adjuvant! carried largelyindependent and complementary prognostic information to each other.

Performance of the Molecular Signature in Different Clinical Subgroups

Table 13 shows that the gene signature predicts distant metastasis inyoung and old, pre-menopausal and post-menopausal women. While highlyprognostic in patients with small size tumors (HR=14.16, p=0.009), it isnot significant in patients with tumors larger than 2 cm. While hazardratio in the low-grade subgroup (HR=7.6) is higher than that in thehigh-grade (HR=4.6), it only shows a trend to significance (p=0.06) inthe low-grade subgroup because of small sample size (7 events in 60samples).

Hazard ratios in various subgroups indicated that the gene signature ismore prognostic in low grade, small size tumors, young andpre-menopausal patients in the validation sample set. Formal tests forinteraction between the MS risk group and the clinical variables werenot significant. However, the signature was also more prognostic in thelow-grade tumors in the CPMC training set. Nevertheless, interactionanalyses should be regarded as exploratory as multiple tests wereperformed.

Diagnostic Accuracy and Predictive Values

The diagnostic accuracy and predictive values of the risk groups by MSand Adjuvant! to predict distant metastases in 10 years were shown inTable 14. The MS risk group has higher sensitivity of 0.94 (0.84-0.98)than the Adjuvant! risk group's 0.90 (0.78-0.96) while the specificityis similar (0.3 (0.24-0.37) for MS vs. 0.31 (0.26-0.38) for Adjuvant!).Using 0.18 as the estimated prevalence of distant metastases in 10years, PPV and NPV for MS risk group were 0.23 (0.21-0.25) and 0.97(0.88-0.99) respectively. The corresponding values were 0.23 (0.20-0.25)and 0.93 (0.85-0.97) for the Adjuvant! risk group. Therefore, MS canslightly better predict those who would not have distant metastaseswithin 10 years than Adjuvant! while the predictive values for those whowould have distant metastases within 10 years were similar for themolecular and clinical prognosticators.

ROC curves (FIG. 4) of continuous MS to predict distant metastasiswithin 5 years and 10 years had AUC (95% CI) of 0.73 (0.65-0.81), 0.70(0.63-0.78). A ROC curve to predict death in 10 years had an AUC of 0.68(0.61-0.75). Hence, MS are predictive of both distant metastases anddeaths.

In comparison, AUCs of ROC curves to predict distant metastases within 5years, 10 years and death in 10 years by Adjuvant! were 0.63(0.53-0.72), 0.65 (0.57-0.73) and 0.63 (0.56-0.71) and they were lowerthan the corresponding values by MS.

MS as a Continuous Predictor of Probability of Distant Metastasis

FIG. 5 shows the probabilities of distant metastasis at 5 and 10 yearsfor an individual patient with a metastasis score, MS. Five-year andten-year distant metastasis probabilities have median (min-max) of 8.2%(1.4%-31.2%) and 15.2% (2.7%-50.9%) respectively. At zero MS, the cutpoint to define the risk groups, the 5-year and 10-year distantmetastasis probabilities were 5% and 10%, respectively.

The probability of distant metastasis in 10 years by MS was comparedwith the probability of relapse in 10 years by Adjuvant! (FIG. 6). Thecoefficient of determination (R²) was 0.15 indicating that only a smallportion of variability in probability of distant metastasis by MS can beexplained by Adjuvant! The probability of distant metastasis by MS waslower than the relapse probability by Adjuvant! as all recurrence eventswere included in the Adjuvant! relapse probability while only distantmetastases were counted as an event in the MS estimate of probability ofdistant metastasis.

DISCUSSION

Initially a 14-gene prognostic signature was developed based upon mRNAexpression from FFPE sections using quantitative RT-PCR for distantmetastasis in a node-negative, ER-positive, early-stage, untreatedbreast cancer training set. The resulting signature was used to generatea metastasis score (MS) that quantifies risk for individuals atdifferent timeframes and was used to dichotomize the sample set intohigh and low risk. Following initial internal validation of training setusing a recent “pre-validation” statistical technique, we validated theexpression signature using the precise dichotomized cutoff of thetraining set in a similar and independent validation cohort. Performancecharacteristics of the signature in training and validation sets weresimilar. Univariate and multivariate hazard ratios were 6.12 and 4.81for the validation set and 4.23 and 3.26 in the training set to predictDMFS, respectively. In multivariate analysis, only the metastasis scoreremained significant with a trend to significance for only tumor size ofthe other clinicopathological factors. The 14-gene prognostic signaturecan also predict overall survival with univariate hazard ratios of 2.49in the validation set. ROC curves of continuous MS to predict distantmetastasis within 5 and 10 years and to predict death in 10 years hadAUC of 0.73, 0.70 and 0.68, respectively. The signature provided morecompelling prognosis when the tumor grade was low (hazard ratio were7.58 in the low grade and 4.59 in the high grade tumors). Dai et al, forexample, interpreted the change in prognostic power of the classifier asnot being reflective of a continuum of patients but instead differentialperformance in discrete groups of patients.

When compared to risk calculated from Adjuvant!, a web-based decisionaid, there was only a modest correlation with MS. In multivariate Coxregression, MS and Adjuvant! risk groups both remain significantprognostic factors when they are adjusted for each other. There werelarger differences in survival rates between the high-risk and low-riskgroups defined by MS relative to Adjuvant!. MS can better predict thosewho would not have distant metastases within 10 years than Adjuvant!.These data demonstrate that the molecular signature provides independentinformation to the prognostic tools either routinely or more recentlybeing adopted for predicting breast cancer distant metastasis.

Example Three The 14-Gene Signature Predicts Distant Metastasis in BothTreated and Untreated Node-Negative and ER-Positive Breast CancerPatients Using 96 FFPE Samples

Efforts were undertaken to validate a 14-gene expression signature thatcan predict distant metastasis in node negative (N−), estrogen receptorpositive (ER+) breast cancer patients in an independent sample sethaving both treated and untreated patients. Reference is made to theexperimental protocols and statistical analyses in Example 1, which wereused to assay and evaluate the effectiveness of the 14-gene signature.

Patients & Methods

A cohort of 96 N−, ER+ breast cancer patients, with a mean age of 56.7years, was selected for the validation study (Table 15). The patients inthe validation study were selected from University of Muenster. Ofthose, 15 were untreated, 54 were treated with Tamoxifen alone, 6 weretreated with chemotherapy alone, and 2 were treated with both Tamoxifenand chemotherapy. Nineteen patients had unknown treatment status. Thefourteen genes in the signature were profiled in FFPE samples usingquantitative RT-PCR. A previously derived metastasis score (MS) wascalculated for the validation set from the gene expression levels.Patients were stratified into two groups using a pre-determined MS cutpoint, which was zero.

Validation of Metastasis Score

MS in Equation 1 that was derived with samples in Example 1 was appliedto the patients from University of Muenster. In this example, RNA fromthe tumor tissues was also enriched but to a lesser extent than those inExample 1. To apply MS to this example, conversion factors betweenenriched and un-enriched samples were obtained from 93 samples fromUniversity of Muenster for each of the 14 genes in the signature. Theconversion factors between enriched and unenriched samples were alsoobtained from 93 training samples from CPMC in Example 1. The conversionfactors between the gene expression levels from CPMC and University ofMuenster were then calculated using those two sets of conversionfactors.

Results Distant-Metastasis-Free Survival Rates

Using MS zero as cut point, patients were classified as high-risk andlow-risk. Of all 96 patients, 48 patients were identified as high-riskwith a 5-year DMF survival rate (standard error) of 0.61 (0.072) while48 low-risk patients had a corresponding survival rate of 0.88 (0.052).Of the 62 treated patients, 32 high-risk and 30 low-risk patients had5-yr DMF survival rates of 0.66 (0.084) and 0.89 (0.060), respectively.Of the 54 patients who received Tamoxifen treatment alone, 26 high-riskand 28 low-risk patients had 5-yr DMF survival rates of 0.65 (0.094) and0.88 (0.065) respectively (Table 16).

Unadjusted Hazard Ratios

For the entire cohort, the MS correlated with distant metastasis-free(DMF) survival. Cox proportional hazard regression indicated 2.67(1.28-5.57) times increased hazard per unit increase of MS (p=0.0087).Using zero as cut point, The hazard ratio of high-risk vs. low-riskpatients in the entire cohort was 2.65 (1.16-6.06, p=0.021). For 61treated patients, the hazard ratio was 3.08 (0.99-9.56, p=0.052). Forthe 54 patients who were treated with Tamoxifen only, the hazard ratiowas 2.93 (0.92-9.35, p=0.07). Survival rates and hazard ratios fordifferent groups of patients are summarized in Table 16 and FIG. 7,respectively.

An RT-PCR based 14-gene signature, originally derived from untreatedpatients, can predict distant metastasis in N−, ER+, Tamoxifen-treatedpatients in an independent sample set using FFPE tissues. There was alarge differential DMFS rates between high and low riskTamoxifen-treated alone patients (0.65 vs. 0.88) where two groups weredefined by MS using zero as cut point. Differential risk between top andbottom quintiles of multi-modal MS were 3.99 and 3.75 fold for all andTamoxifen-treated alone patients, respectively. The prognostic signaturemay provide baseline risk that is not confounded with systemictreatment. Moreover, it can predict metastatic risk for patients whoreceive treatment. Therefore, the gene signature would be applicable inidentifying women with a poor clinical outcome to guide treatmentdecisions, independent of the subsequent therapies.

Example Four The 14-Gene Signature Predicts Distant Metastasis inTamoxifen-Treated Node-Negative and ER-Positive Breast Cancer PatientsUsing 205 FFPE Samples Patients

A cohort of 205 women with N−, ER+ breast cancer who had surgery between1975 and 2001 in Guy's Hospital was selected. The median follow-up was9.3 years. Among them, there were 17 (8.9%) distant metastases, 44deaths (21.5%) and 17 (8.9%) local and distant recurrences. 138 (67.3%)patients were at stage I while 67 (32.7%) were at stage II. All patientsreceived adjuvant hormonal treatment but no chemotherapy. The cohort hada mean (SD) age of 59.3 (10.4) years. 64% were over the age of 55 yearsand 80.5% were post-menopausal. All tumors were ≦3 cm in diameter andthe mean (SD) tumor diameter was 1.67 cm (1.0). 60 (29.3%), 98 (47.8%)and 47 (22.9%) patients had tumors of histological grade 1, 2 and 3,respectively. (Table 17)

Endpoints

We chose time from surgery to distant metastasis, also referred to asdistant-metastasis-free survival (DMFS), as the primary endpoint. Eventswere distant metastases. Contra-lateral recurrences and deaths withoutrecurrence were censoring events while local recurrences were notconsidered events or censoring events. The definition of DMFS endpoint,its events and censoring rules were aligned with those adopted by theNational Surgical Adjuvant Breast and Bowel Project (NSABP) for theprognostic molecular marker studies (Paik et al 2004). The DMFS endpointis most directly linked to cancer related death.

Gene Expression Signature and Metastasis Score (MS)

A 14-gene signature was previously developed using profiling study byRT-PCR with FFPE samples from California Pacific Medical Center (CPMC)as described in Example 1. Pathway analyses by the program Ingenuityrevealed that the majority of the 14 genes in the signature are involvedwith cell proliferation. Ten of 14 genes are associated with TP53signaling pathways that have been found to be coordinatelyover-expressed in tumors of poor-outcome.

A Metastasis Score (MS) was calculated for each individual. MS in thisexample was based upon the gene expression of the 14 genes in thesignature as previously described. However, the algorithm forcalculating MS in this example was different from the algorithmdescribed for the previous examples. Nevertheless, MS derived with thenew algorithm was highly correlated with MS derived with the previousmethod with Pearson correlation coefficient>0.99. Moreover, in thisexample, two cut points were employed to group patients into high,intermediate and low MS groups as opposed to using only one cut point tocategorize patients into low and high risk groups in the previous threeexamples.

While the 14 genes in the signature were chosen in the study asdescribed in Example 1, the new MS score and cut points were determinedbased upon the study using untreated samples from Guy's Hospital asdescribed in Example 2. The new MS algorithm was applied and validatedin Examples 4 and 5. The new Metastasis Score (MS(new)) is nowcalculated as the negative of the mean of the gene expression level of14 genes. With this new score, the fourteen genes were given equalweighting. The −1 multiplier was used so that higher MS corresponds tohigher risk of distant metastasis. The new MS can be expressed in thefollowing formula:

$\begin{matrix}{{{MS}({new})} = {{- \left( {1\text{/}14} \right)}*\left\lbrack {\sum\limits_{i = 1}^{14}\; {Gi}} \right\rbrack}} & {{Equation}\mspace{14mu} 3}\end{matrix}$

where Gi are the centered expression levels of the 14 genes in thesignature.

Two cut points of MS(new) were chosen to categorize patients into high,intermediate and low MS groups. The lower cut point was −1.47 while theupper cut point was −0.843. Individuals with MS smaller than −1.47 werein the low MS group. Individuals with MS between −1.47 and −0.843 werein the intermediate MS group while those with MS greater than −0.843were in the high MS group. If those with low MS were considered low-riskwhile those in intermediate MS and high MS groups were considered ashigh risk in Guy's untreated samples (in other word those with MS above−1.47 were considered high risk), then sensitivity of the MS risk groupswould be above 90%. On the other hand, if those with low MS andintermediate MS were considered low-risk while those with high MS wereconsidered high-risk (in other words, those with MS lower than −0.843were considered low-risk while those with MS higher than −0.843 wereconsidered high-risk), then the sensitivity and specificity of the MSrisk groups in Guy's untreated samples would be the same at 70%.

For the untreated samples, the intermediate MS group has risk similar tothat of high MS group and the high and intermediate MS groups had higherrisk than those with low MS. However, for patients treated with hormonaltreatment, the intermediate MS group has risk similar to that of the lowMS group. The risk of high MS group is higher than the risk ofintermediate MS and low MS groups.

Another method of applying the 14 gene signature is by using Equation 2,as follows. A 14-gene signature was previously developed using profilingstudy by RT-PCR with FFPE samples from California Pacific Medical Center(CPMC) as described in Example 1. Pathway analyses by the programIngenuity revealed that the majority of the 14 genes in the signatureare involved with cell proliferation. Ten of 14 genes are associatedwith TP53 signaling pathways that have been found to be coordinatelyover-expressed in tumors of poor-outcome.

A Metastasis Score (MS) was calculated for each individual. MS in thisexample was based upon the gene expression of the 14 genes in thesignature as previously described. However, the algorithm forcalculating MS in this example was based upon Equation 2 in which thefourteen genes were weighted equally. Moreover, in this example, two cutpoints were employed to group patients into high, intermediate and lowMS groups as opposed to using only one cut point to categorize patientsinto low and high risk groups in the previous three examples. While the14 genes in the signature were chosen in the study as described inExample 1, the new MS score and cut points were determined based uponthe study using untreated samples from Guy's Hospital as described inExample 2. The new MS algorithm was applied and validated in Examples 4and 5.

Two cut points of MS(new) were chosen to categorize patients into high,intermediate and low MS groups. Cut points were determined such thatwhen individuals with MS above the first cut point of −0.119 wereclassified as high-risk individuals to have distant metastasis in 5years, the sensitivity of the MS risk groups would be above 90%. Thesecond cut point of MS=0.302 was chosen such that sensitivity andspecificity would be the same at 0.7.

It should be noted that the MS determined in Equation 2 and MS as thenegative of the mean of gene expression of all fourteen genes are simplylinear transformation of each other. As such they have perfectcorrelation (Pearson correlation coefficient=1) and the classificationof patients into high, intermediate, and low MS are the same using thecorresponding cut points as described.

Statistical Analyses

Kaplan-Meier (KM) curves for distant metastasis free survival weregenerated for the high, intermediate and low MS groups. Upon examiningthe DMFS rates of the three groups, intermediate and low MS groups werecombined as a low-risk group which was compared with the high risk groupwith high MS. Log rank tests were performed.

Univariate and multivariate Cox proportional hazard regression analysesof MS groups for DMFS endpoint were performed. Hazard ratio of high-risk(high MS) vs. low-risk (intermediate and low MS groups combined) groupwas adjusted for age (in years), tumor size (in cm) and histologicalgrade in multivariate analysis.

Association of MS groups with age, tumor size was investigated usingANOVA tests while association of MS groups with histological grade wasevaluated by Crammer's V for strength of association and by chi-sq testsfor statistical significance.

Hazard ratios of the MS risk groups were also calculated for differentclinical subgroups. Those groups include pre-menopausal vs.post-menopausal, age≦55 years vs. >55 years, tumor size>2 cm vs. ≦2 cm,and histological grade 1 & 2 vs. grade 3.

To assess diagnostic accuracy, Receiver Operator Characteristic (ROC)curve of MS to predict distant metastases within 5 years was plotted.Area under the ROC curves (AUC) was calculated. Sensitivity,specificity, positive predictive value (PPV) and negative predictivevalue (NPV) were calculated with 95% confidence interval (95% CI) forhigh vs. low-risk groups by MS.

The time-dependence of hazard ratio of MS groups was investigated byestimating the annualized hazards using a spline-curve fitting techniquethat can handle censored data. The HEFT procedure in R2.4.1 wasemployed. Annualized hazards were estimated for both MS high andlow-risk groups and from which, the hazard ratios at different time werecalculated.

Kaplan Meier estimates and Cox proportional hazard regression wereperformed using R2.41.1 and SAS 9.1. ROC curves and AUC were estimatedusing the Mayo Clinic's ROC program. The Delong method of estimatingconfidence interval of AUC was employed.

Results Distant-Metastasis-Free Survival Rates in MS Low-Risk andHigh-Risk Groups

There were 8 distant metastases in the low MS group of 136 individuals,2 distant metastases in the intermediate MS group of 29 individuals and7 distant metastases in the high MS group of 40 individuals. The 10-yearDMFS rates (SE) were 0.921 (0.028), 0.966 (0.034) and 0.804 (0.068) forlow, intermediate, and high MS groups, respectively. There weresignificant differences in DMFS rates with a log-rank p-value of 0.04.As DMFS rates were similar in low and intermediate MS groups, they werecombined to form the low-risk group. The low-risk group had a 10-yearDMFS rate of 0.928 (0.025) and was significantly different from thecorresponding rate of 0.804 (0.068) for the high-risk group (Table 18).The log-rank p-value was 0.011. Kaplan-Meier plots ofdistant-metastasis-free survival for the three MS groups and the two MSrisk groups were in FIG. 8 and FIG. 9, respectively.

Hazard Ratios from Univariate and Multivariate Cox Regression Models

The unadjusted hazard ratio of the MS high-risk vs. low-risk groups topredict time to distant metastases was 3.25 (95% CI=1.24 to 8.54,p-value=0.017). When adjusted by age, tumor size and histological grade,the hazard ratio was 5.82 (1.71-19.75, p=0.0047) in Table 19. MS riskgroup was the only risk factor that was significant in the multivariateanalyses. Therefore, the gene signature has independent prognostic valuefor DMFS over the traditional clinicopathological risk factors andcaptures part of the information of these factors.

Association of MS Risk Groups with Other Clinical and PathologicalCharacteristics

MS risk group had very significant association with histological grade(Crammer's V=0.65, p<0.0001 for chi-sq test for association). For 60grade 1 tumors, none (0%) was MS high-risk. In contrary, 9 (9.2%) of 98grade 2 and 31 (66%) of 47 grade 3 tumors were MS high-risk.

While tumor size was larger in the MS high-risk group, the differencewas not statistically different (1.87 cm and 1.61 cm in MS high andlow-risk groups respectively, ANOVA p-value=0.14). There was nosignificant association of MS risk groups with age (Mean age is 59.3 inboth high and low-risk groups, ANOVA p=0.34). Results were summarized inTable 20.

Performance of the Molecular Signature in Different Clinical Subgroups

Hazard ratio of MS risk groups was 3.7 (1.1-12.6) in tumors≦2 cm and 3.0(0.6-14.8) in tumors>2 cm. In women younger than 55 years, HR was 7.4(1.0-54.7) while it was 2.8 (0.88-8.8) in women older than 55 years. Fortumors of histological grade 1 and 2, HR was 12.8 (3.8-43.1) while HRwas 1.53 (0.16-14.7) in grade 3 tumors (Table 21).

Diagnostic Accuracy and Predictive Values

Sensitivity, specificity, PPV and NPV of MS risk groups to predictdistant metastasis in 5 years were shown in Table 22. Sensitivity of MSrisk group was 0.50 (0.50-0.76) while specificity was 0.82 (0.76-0.87).Using the estimated 5-year distant metastasis rate of 0.05, PPV and NPVwere estimated to be 0.13 (0.068-0.23) and 0.97 (0.94-0.98) respectively(Table 22). High NPV of the MS risk group was important for it to beused for ruling out more aggressive treatment such as chemotherapy forpatients with low-risk.

ROC curve of continuous MS to predict distant metastasis within 5 yearswas shown in FIG. 10 and AUC was estimated to b 0.72 (0.57-0.87).

Time Dependence of the Prognostic Signature

Annualized hazard rates for MS high and low-risk groups were shown inFIG. 11a while the time-dependence of the hazard ratio between twogroups was shown in FIG. 11b . For the high-risk group, annual hazardrate peaked at 2.5% around year 3 from surgery and then slowly decreasedover the next few years. However, the annualized hazard rate in thelow-risk group showed slight but steady increase in the 10-year periodthat had follow-up. Subsequently, hazard ratio of MS risk groups wastime dependent. It was 4.6, 3.6 and 2.1 at year 2, 5 and 10,respectively.

Example Five The 14-Gene Signature Predicts Distant Metastasis inAdjuvant Hormonally-Treated Node-Negative and ER-Positive Breast CancerPatients Using 234 FFPE Samples Patients

A cohort of 234 Japanese women with N−, ER+ breast cancer who hadsurgery between 1995 and 2003 in Aichi Cancer Center was selected. Themedian follow-up was 8.7 years. Among them, there were 31 (13%) distantmetastases, 19 deaths (8.1%) and 46 (19.7%) local and distantrecurrences.

146 (62%) patients were at stage I while 88 (38%) were at stage II. Allpatients received adjuvant hormonal treatment but no chemotherapy. 112post-menopausal women were treated with Tamoxifen. Of 122 pre-menopausalwomen, 102 received Tamoxifen while 20 received Zoladex treatment. Thecohort had a mean (SD) age of 53 (11) years and mean (SD) tumor diameterof 2.05 cm (1.1). 74 (32%), 113 (48%) and 47 (20%) patients had tumorsof histological grade 1, 2 and 3, respectively. (Table 23)

Gene Expression Signature and Metastasis Score (Ms)

A 14-gene expression signature was previously developed and validated inprofiling studies in US and Europe using RT-PCR with FFE samples.Pathway analyses by the program Ingenuity revealed that the majority ofthe 14 genes in the signature are involved with cell proliferation. Tenof 14 genes are associated with TP53 signaling pathways that have beenfound to be coordinately over-expressed in tumors of poor-outcome.

A Metastasis Score (MS) was calculated for each individual. MS was basedupon the negative of the mean of the gene expression levels (in ΔΔCt) ofthe 14 genes in the signature. Moreover, two cut points had previouslybeen determined from a study with tumor samples from untreated patientsfrom Guy's Hospital (Example 2) to group patients into high,intermediate and low MS groups.

Statistical Analyses

Kaplan-Meier (KM) curves for distant metastasis free and overallsurvival were generated for the high, intermediate and low MS groups.Upon examining the DMFS rates of the three groups, intermediate and lowMS groups were combined as a low-risk group which was compared with thehigh risk group with high MS. Log rank tests were performed.

Univariate and multivariate Cox proportional hazard regression analysesof MS groups for DMFS and OS endpoints were performed. Hazard ratio ofhigh-risk (high MS) vs. low-risk (intermediate and low MS groupscombined) group was adjusted for age (in years), tumor size (in cm) andhistological grade in one multivariate analysis. In another multivariateanalysis, it was adjusted for menopausal status, treatment, tumor size,histological grade and PgR status.

Association of MS groups with age, tumor size was investigated usingANOVA tests while association of MS groups with histological grade andtumor subtypes was evaluated by Crammer's V for strength of associatedand by chi-sq tests for statistical significant.

Hazard ratios of the MS risk groups were also calculated for differentclinical subgroups. Those groups include pre-menopausal vs.post-menopausal, age≦55 years vs. >55 years, tumor size>2 cm vs. ≦2 cm,and histological grade 1 & 2 vs. grade 3, PgR +ve vs. −ve.

To assess diagnostic accuracy, Receiver Operator Characteristic (ROC)curve of MS to predict distant metastases within 5 years was plotted.Area under the ROC curves (AUC) was calculated. Sensitivity,specificity, positive predictive value (PPV) and negative predictivevalue (NPV) were calculated with 95% confidence interval (95% CI) forhigh vs. low-risk groups by MS.

The time-dependence of hazard ratio of MS groups was investigated byestimating the annualized hazards using a spline-curve fitting techniquethat can handle censored data. The HEFT procedure in R2.4.1 wasemployed. Annualized hazards were estimated for both MS high andlow-risk groups and from which, the hazard ratios at different time werecalculated.

Kaplan Meier estimates and Cox proportional hazard regression wereperformed using R2.41.1 and SAS 9.1. ROC curves and AUC were estimatedusing the Mayo Clinic's ROC program. The Delong method of estimatingconfidence interval of AUC was employed.

Results Distant-Metastasis-Free Survival Rates in MS Low-Risk andHigh-Risk Groups

There were 6 distant metastases in the low MS group of 77 individuals,and 4 distant metastases in the intermediate MS group of 66 individualsand 21 distant metastases in the high MS group of 95 individuals. The10-year DMFS rates (SE) were 0.89 (0.05), 0.91 (0.04) and 0.75 (0.05)for low, intermediate, and high MS groups, respectively. There wassignificant difference in DMFS rates with a log-rank p-value of 0.004.As DMFS rates were similar in low and intermediate MS groups, they werecombined to form the low-risk group. The low-risk group had a 10-yearDMFS rate of 0.895 (0.034) and is significantly different from thecorresponding rate of 0.75 (0.05) for the high-risk group (Table 24).The log-rank p-value is 0.00092. Kaplan-Meier plots ofdistant-metastasis-free survival for three MS groups were in FIG. 12while Kaplan-Meier plots for the two risk groups (high MS and acombination of intermediate and low MS) were in FIG. 13.

Hazard Ratios from Univariate and Multivariate Cox Regression Models

The unadjusted hazard ratio of the MS high-risk vs. low-risk groups topredict time to distant metastases was 3.32 (95% CI=1.56 to 7.06,p-value=0.0018). When adjusted by age, tumor size and histologicalgrade, the hazard ratio was 3.79 (1.42-10.1, p=0.0078). Beside MS, tumorsize is the only other significant factor in the multivariate analyseswith HR of 1.4 per cm increase (p=0.007) (Table 25). In anothermultivariate analysis, MS risk groups were adjusted by menopausalstatus, treatment, tumor size, histological grade and PgR status. Theadjusted hazard ratio of MS risk group was 3.44 (1.27-9.34) (Table 27).Again, tumor size was the only other significant factor in thismultivariate analysis (HR=1.45 per cm increase, p=0.0049). Therefore,the gene signature has independent prognostic value for DMFS over thetraditional clinicopathological risk factors and captures part of theinformation of these factors.

Association of MS Risk Groups with Other Clinical and PathologicalFactors

MS risk group had very significant association with histological grade(Crammer's V=0.54, p<0.0001 for chi-sq test for association). For 74grade 1 tumors, only 4 (5.4%) were MS high-risk. In contrary, 54 (47.8%)of 113 grade 2 and 37 (78.7%) of 47 grade 3 tumors were MS high-risk.

MS risk groups were also associated with tumor subtypes (Cramer'sV=0.25, p=0.02). While 23 (29.8%) of 77 Scirrhous tumors were MShigh-risk, 45 (41.7%) of 108 Papillotubular tumors and 24 (63.2%) of 38solid-tubular were MS high-risk.

While tumor size is larger in the MS high-risk group (2.23 cm and 1.93in MS high and low-risk groups respectively, ANOVA p-value=0.037), therewas no significant association of MS with age (p=0.29). Results weresummarized in Table 26.

Performance of the Molecular Signature in Different Clinical Subgroups

MS risk group can best predict distant metastases in young (age≦55years), pre-menopausal women with tumors that were ≦2 cm, low grade(grade 1 and 2) and PgR +ve. Hazard ratio of MS risk groups was 4.5(1.2-17.3) in tumors≦2 cm and 2.3 (0.92-5.6) in tumors>2 cm. Inpre-menopausal women, HR was 6.0 (1.6-23.3) while it was 2.1 (0.83-5.1)in post-menopausal women. For those with tumors of histological grade 1and 2, HR was 3.6 (1.5-8.4) while HR was 2.4 (0.29-18.8) in grade 3tumors. HR of MS risk group was 3.5 (1.4-9.0) in PgR +ve tumors while itwas 2.1 (0.57-7.49) in PgR −ve tumors (Table 28).

Diagnostic Accuracy and Predictive Values

Sensitivity, specificity, PPV and NPV of MS risk groups to predictdistant metastasis in 5 years were shown in Table. Sensitivity of MSrisk group was 0.81 (0.60-0.92) while the specificity was 0.65(0.58-0.71). Using the estimated 5-year distant metastasis rate of0.095, PPV and NPV were estimated to be 0.19 (0.15-0.24) and 0.97(0.93-0.99) respectively (Table 29). High NPV of the MS risk group wasimportant for it to be used for ruling out more aggressive treatmentsuch as chemotherapy for patients with low-risk. ROC curve of continuousMS to predict distant metastasis within 5 years was shown in FIG. 14 andAUC was estimated to b 0.73 (0.63-0.84).

Time Dependence of the Prognostic Signature

Annualized hazard rates for MS high and low-risk groups were shown inFIG. 15a while the time-dependence of the hazard ratio between twogroups was shown in FIG. 15b . For the high-risk group, annual hazardrate peaked at 3.5% around year 3 from surgery and then slowly decreasedover the next few years. However, the annualized hazard rate in thelow-risk group showed slight but steady increase in the 10-year periodthat had follow-up. Subsequently, hazard ratio of MS risk groups wastime dependent. It was 4.8, 3.4 and 1.8 at year 2, 5 and 10,respectively.

As seen from this example and the previous examples, the 14 genesignature is shown to be an effective risk predictor in breast cancerpatients of both Caucasian and Asian ethnic background, indicating therobustness of the 14 gene prognostic signature.

All publications and patents cited in this specification are hereinincorporated by reference in their entirety. Various modifications andvariations of the described compositions, methods and systems of theinvention will be apparent to those skilled in the art without departingfrom the scope and spirit of the invention. Although the invention hasbeen described in connection with specific preferred embodiments andcertain working examples, it should be understood that the invention asclaimed should not be unduly limited to such specific embodiments.Indeed, various modifications of the above-described modes for carryingout the invention that are obvious to those skilled in the field ofmolecular biology, genetics and related fields are intended to be withinthe scope of the following claims.

TABLE 1 Clinical and pathological characteristics of patients from CPMCand Guy's Hospital Clinical and pathological characteristics ofnode-negative, ER-positive, untreated patients Training (UCSF)Validation (Guy's) Characteristics n = 142 n = 280 Age ≦55 yrs 40(28.2%) 144 (51.4%) >55 yrs 102 (71.8%) 136 (48.6%) Mean (Std. dev.) 62yrs (12.6) 55.5 yrs (11.6) Min.-Max. 31 yrs-89 yrs 29 yrs-87 yrs Tumordiameter ≦2 cm 126 (88.7%) 167 (59.6%) >2 cm 8 (5.6%) 113 (40.4%)Missing 8 (5.6%) 0 (0%) Mean (Std. Dev.) 1.28 cm (0.50) 1.93 cm (0.85)Tumor grade Grade 1 74 (52.1%) 60 (21.4%) Grade 2 61 (43%) 166 (59.3%)Grade 3 4 (2.8%) 54 (19.3%) Missing 3 (2.1%) 0 (0%) Stage I 117 (82.4%)167 (59.6%) IIA 25 (17.6%) 113 (40.4%) Distant Recurrence Yes 31 (21.8%)66 (23.4%) No 111 (78.2%) 214 (76.6%) Death of all cause Yes 56 (39.4%)135 (48.2%) No 86 (60.6%) 145 (51.8%) Median follow up 8.7 yrs 15.6 yrs

TABLE 2 Genes comprising the 14-gene metastasis prognostic panel andendogenous controls. Gene MS constant ai Ref Seq Description ReferenceCENPA 0.29 NM_001809 centromere protein A, Black, B. E., Foltz, D. R.,et al., Nature 17 kDa 430 (6999): 578-582 (2004) PKMYT1 0.29 NM_004203membrane-associated Bryan, B. A., Dyson, O. F. et al., J. Gen. tyrosine-and Virol. 87 (PT 3), 519-529 (2006) thereonine-specific cdc2-inhibitorykinase MELK 0.29 NM_014791 maternal embryonic Beullens, M.,Vancauwenbergh, S. et al., leucine zipper kinase J. Biol. Chem. 280(48), 40003-40011 (2005) MYBL2 0.29 NM_002466 v-myb myeloblastosisBryan, B. A., Dyson, O. F. et al., J. Gen. viral oncogene Virol. 87 (PT3), 519-529 (2006) homolog (avian)-like 2 BUB1 0.27 NM_004366 BUB1budding Morrow, C. J., Tighe, A. et al., J. Cell. Sci. uninhibited by118 (PT 16), 3639-3652 (2005) benzimidazoles 1 homolog RACGAP1 0.29NM_013277 Rac GTPase activating Niiya, F., Xie, X. et al., J. Biol.Chem. 280 protein 1 (43), 36502-36509 (2005) TK1 0.27 NM_003258thymidine kinase 1, Karbownik, M., Brzezianska, E. et al., solubleCancer Lett. 225 (2), 267-273 (2005) UBE2S 0.27 NM_014501ubiquitin-conjugating Liu, Z., Diaz, L. A. et al., J. Biol. Chem. enzymeE2S 267 (22), 15829-15835 (1992) DC13 0.22 AF201935 DC13 protein Gu, Y.,Peng, Y. et al., Direct Submission, Submitted (5 Nov. 1999) ChineseNational Human Genome Center at Shanghai, 351 Guo Shoujing Road,Zhangjiang Hi-Tech Park, Pudong, Shanghai 201203, P. R. China RFC4 0.25NM_002916 replication factor C Gupte, R. S., Weng, Y. et al., Cell Cycle4 (activator 1) 4, 37 kDa (2), 323-329 (2005) PRR11 0.26 NM_018304proline rich 11 Weinmann, A. S., Yan, P. S. et al., Genes (FLJ11029)Dev. 16 (2), 235-244 (2002) DIAPH3 0.23 NM_030932 diaphanous homolog 3Katoh, M. and Katoh, M., Int. J. Mol. (Drosophila) Med. 13 (3), 473-478(2004) ORC6L 0.28 NM_014321 origin recognition Sibani, S., Price, G. B.et at. Biochemistry complex, subunit 6 44 (21), 7885-7896 (2005)homolog-like (yeast) CCNB1 0.23 NM_031966 cyclin B1 Zhao, M., Kim, Y. T.et al., Exp Oncol 28 (1), 44-48 (2006) PPIG EC NM_004792 peptidylprolylLin, C. L., Leu, S. et al., Biochem. Biophys. isomerase G Res. Commun.321 (3), 638-647 (2004) NUP214 EC NM_005085 nucleoporin 214 kDa Graux,C., Cools, J. et al., Nat. Genet. 36 (10), 1084-1089 (2004) SLU7 ECNM_006425 step II splicing factor Shomron, N., Alberstein, M. et al., J.Cell. Sci. 118 (PT 6), 1151-1159 (2005) NOTE: PCR primers for expressionprofiling of all genes disclosed herein were designed to amplify alltranscript variants known at time of filing. EC = Endogenous Control.Ref Seq = NCBI reference sequence for one variant of this gene.

TABLE 3 Primers used in gene ex- pression profiling. Orien- GeneSequence SEQ ID tation BUB1 CATGGTGGTGCCT SEQ ID Upper TCAA NO: 1 CCNB1GCCAAATACCTGA SEQ ID Upper TGGAACTAA NO: 3 CENPA CAGTCGGCGGAGA SEQ IDUpper CAA NO: 5 DC13 AAAGTGACCTGTG SEQ ID Upper AGAGATTGAA NO: 7 DIAPH3TTATCCCATCGCC SEQ ID Upper TTGAA NO: 9 MELK AGAGACGGGCCCA SEQ ID UpperGAA NO: 11 MYBL2 GCGGAGCCCCATC SEQ ID Upper AA NO: 13 ORC6LCACTTCTGCTGCA SEQ ID Upper CTGCTTT NO: 15 PKMYT1 CTACCTGCCCCCT SEQ IDUpper GAGTT NO: 17 PRR11 TGTCCAAGCTGTG SEQ ID Upper GTCAAA NO: 19RACGAP1 GACTGCGAAAAGC SEQ ID Upper TGGAA NO: 21 RFC4 TTTGGCAGCAGCTSEQ ID Upper AGAGAA NO: 23 TK1 GATGGTTTCCACA SEQ ID Upper GGAACAA NO: 25UBE2S CCTGCTGATCCAC SEQ ID Upper CCTAA NO: 27 NUP214 CACTGGATCCCAASEQ ID Upper GAGTGAA NO: 29 PPIG TGGACAAGTAATC SEQ ID Upper TCTGGTCAANO: 31 SLU7 TGCCAATGCAGGA SEQ ID Upper AAGAA NO: 33 BUB1 GCTGAATACATGTSEQ ID Lower GAGACGACAA NO: 2 CCNB1 CTCCTGCTGCAAT SEQ ID Lower TTGAGAANO: 4 CENPA AAGAGGTGTGTGC SEQ ID Lower TCTTCTGAA NO: 6 DC13CGCCCTGCCCAAC SEQ ID Lower AA NO: 8 DIAPH3 TGCTCCACACCAT SEQ ID LowerGTTGTAA NO: 10 MELK CAACAGTTGATCT SEQ ID Lower GGATTCACTAA NO: 12 MYBL2CATCCTCATCCAC SEQ ID Lower AATGTCAA NO: 14 ORC6L GGATGTGGCTACC SEQ IDLower ATTTTGTTT NO: 16 PKMYT1 AGCATCATGACAA SEQ ID Lower GGACAGAA NO: 18PRR11 TCTCCAGGGGTGA SEQ ID Lower TCAGAA NO: 20 RACGAP1 TTGCTCCTCGCTTSEQ ID Lower AGTTGAA NO: 22 RFC4 CACGTTCATCAGA SEQ ID Lower TGCATTTAANO: 24 TK1 GGATCCAAGTCCC SEQ ID Lower AGCAA NO: 26 UBE2S GCATACTCCTCGTSEQ ID Lower AGTTCTCCAA NO: 28 NUP214 TGATCCCACTCCA SEQ ID LowerAGTCTAGAA NO: 30 PPIG GTATCCGTACCTC SEQ ID Lower CGCAAA NO: 32 SLU7TGGTATCTCCTGT SEQ ID Lower GTACCTAACAAA NO: 34

TABLE 4 Mean (μ) and standard deviation (σ) of gene expression levels ofthe fourteen genes in the signature in 142 samples from CPMC (ai are theloadings on the first principal components) Gene ai μ σ CENPA 0.29 −0.181.40 TK1 0.27 −1.07 1.68 BUB1 0.27 2.69 1.78 PRR11 0.26 5.25 1.90 UBE2S0.27 2.64 1.41 DC13 0.22 0.47 1.15 DIAPH3 0.23 1.90 1.34 MELK 0.29 −1.771.41 MYBL2 0.29 2.54 1.88 PKMYT1 0.29 −0.35 1.73 RFC4 0.25 1.25 0.99ORC6L 0.28 1.98 1.43 RACGAP1 0.29 4.00 1.09 CCNB1 0.23 3.20 1.66

TABLE 5 Distant metastasis-free and overall survival rates for low-riskand high-risk prognosis group at 5 and 10 years for patients from CPMCOverall Metastsis-free survival rate survival rate No. of 5-yr 10-yr5-yr 10-yr Group patients (std. error) (std. error) (std. error) (std.error) Low  71 0.96 (0.025) 0.90 (0.037) 0.90 (0.036) 0.78 (0.059) riskHigh  71 0.74 (0.053) 0.62 (0.066) 0.79 (0.049) 0.48 (0.070) risk All142 0.85 (0.031) 0.76 (0.040) 0.84 (0.031) 0.63 (0.048)

TABLE 6 Univariate and multivariate Cox proportional hazard regressionanalyses of 14-gene prognostic signature, tumor size, tumor grade andage for patients from CPMC Univariate analysis Multivariate analysisHazard ratio P Hazard ratio P Variable (95% CI) Value (95% CI) Value14-gene 4.23 (1.82-9.85) 0.0008 3.26 (1.26-8.38) 0.014 signature Age1.00 (0.97-1.03) 0.850  1.00 (0.98-1.03) 0.810 Tumor size 1.07(1.00-1.14) 0.050  1.03 (0.96-1.10) 0.450 Tumor grade 2.18 (1.04-4.59)0.040  1.26 (0.55-2.87) 0.580 (moderate + high) For 14-gene prognosticsignature, hazard ratio compares high-risk vs. low-risk groups usingmedian MS (CV) to classify patients. For age, hazard ratio is given asthe hazard increase for each year increase in age. For tumor size,hazard ratio is given as hazard increase per each centimeter increase indiameter. For tumor grade, hazard of the group with medium andhigh-grade tumors vs. low-grade tumors.

TABLE 7 Gene M value PPIG 0.8046 SLU7 0.8741 NUP214 0.8886 PPP1CA 1.0256TERF2 1.1907 EEF1A1 1.1994

TABLE 8 Gene M value PPIG 0.5697 NUP214 0.5520 SLU7 0.6075

TABLE 9 Prognostic values of the molecular prognostic signature usingmedian MS (CV) as cut point to predict distant metastasis within 5 yearsand metastasis-free in more than 5 years for patients from CPMC (A).Distant metastasis within 5 years - prognosis vs. actual outcome DistantMetastasis Disease-free Within 5 yr >5 yr Group (n = 21) (n = 95) Highrisk by MS 18 39 Low risk by MS  3 56 (B). Diagnostic metrics ofprognosis signature to predict distant metastasis within 5 years Value95% CI Sensitivity 0.86 0.65-0.95 Specificity 0.59 0.49-0.63 Odds ratio8.62  2.37-31.26 PLR 2.09 1.55-2.81 NLR 0.24 0.084-0.70  PPV* 0.260.21-0.33 NPV* 0.96 0.89-0.99 PLR—positive likelihood ratio,NLR—negative likelihood ratio, PPV—positive predictive value,NPV—negative predictive value *Predictive values were calculated withprevalence of distant metastasis at 5 years estimated to be 0.15 fromthe current data. **Individuals with distant metastasis in more than 5years and those censored before 5 years were not included.

TABLE 10 Hazard ratios (unadjusted) (with 95% confidence interval) and pvalues for various risk classification for untreated patients from Guy'sHospital Time to distant Overall Classification metastases Survival Genesignature 6.12 (2.23-16.83) 2.49 (1.50-4.14) (high risk vs. low risk) p= 0.0004 p = 0.0004 Adjuvant! 2.63 (1.30-5.32) 2.89 (1.71-4.87) (highrisk vs. low risk) p = 0.007  p < 0.0001 Age (≦55 yr vs. >55 yr) 1.45(0.89-2.36) 2.91 (2.03-4.18) p = 0.13  p < 0.0001 Tumor size (T2 vs.T1)2.27 (1.39-3.69) 1.60 (1.14-2.25) p = 0.001  p = 0.0062 Histologic grade2.56 (1.17-5.61) 2.56 (1.44-4.54) (grade 2 + 3 vs. grade 1) p = 0.019  p= 0.0013

TABLE 11 Univariate and multivariate Cox model of time to distantmetastases for 14-gene prognostic signature, tumor size, tumor grade andage for untreated patients from Guy's Hospital Table 11a Univariate andmultivariate Cox model of time to distant metastases (DMFS) for 14-geneprognostic signature, tumor size, tumor grade and age Univariateanalysis Multivariate analysis Variable Hazard ratio (95% CI) p-valueHazard ratio (95% CI) p-value 14-gene signature 6.12 (2.23-16.83) 0.00044.81 (1.71-13.53) 0.003 Age 1.03 (1.00-1.05) 0.024  1.01 (0.99-1.04)0.251 Tumor size 1.74 (1.24-2.44) 0.001  1.39 (0.97-2.00) 0.076 Grade 22.45 (1.10-5.43) 0.028  1.43 (0.63-3.23) 0.390 Grade 3 2.98 (1.21-7.30)0.017  1.40 (0.55-3.53) 0.478 For 14-gene prognostic signature, hazardratio compares high-risk vs. low-risk groups using formerly defined zeroMS as cutpoint toclassify patients. For age, hazard ratio is given asthe hazard increase for each year increase in age. For tumor size,hazard ratio is given as hazard increase per each centimeter increase indiameter. For tumor grade, hazards of the groups with grade 2 and grade3 tumors were compared to grade 1 tumors. Table 11b Univariate andmultivariate Cox model of time to distant metastases (DMFS) for riskgroups by MS and by Adjuvant! Univariate analysis Multivariate analysisVariable Hazard ratio (95% CI) p-value Hazard ratio (95% CI) p-value14-gene signature 6.12 (2.23-16.83) 0.0004 5.32 (1.92-14.73) 0.001Adjuvant! 2.63 (1.30-5.32) 0.007  2.06 (1.02-4.19) 0.045 For 14-geneprognostic signature, hazard ratio compares high-risk vs. low-riskgroups using formerly defined zero MS as cutpoint toclassify patients.For Adjuvant!, hazard ratio compares high-risk vs low-risk groups usingcut point of 20% relapse probability in 10 years as calcuated byAdjuvant! Online program.

TABLE 12 Distant metastasis free and overall survival rates for low-riskand high-risk groups by MS at 5 and 10 years for untreated patients fromGuy's Hospital Distant metastasis-free and overall survival rates forlow-risk and high-risk prognosis group by MS at 5 and 10 years No. ofMetastsis-free survival rate Overall survival rate Group patients 5-yr(std. error) 10-yr (std. error) 5-yr (std. error) 10-yr (std. error) Lowrisk 71 0.99 (0.014) 0.96 (0.025) 0.97 (0.020) 0.94 (0.028) High risk209 0.86 (0.025) 0.76 (0.031) 0.92 (0.019) 0.71 (0.032) All 280 0.89(0.019) 0.81 (0.024) 0.93 (0.015) 0.77 (0.026)

TABLE 13 Subgroup analyses: hazard ratio of MS risk groups for time todistant metastases (DMFS) in different subgroups of Adjuvant!,histological grade, tumor size, age and menopausal status for untreatedpatients from Guy's Hospital Hazard ratio of MS risk group no. ofpatients no. of events (95% CI) p-value Adjuvant! High risk 205 57  3.72(1.34-10.28) 0.0114 Low risk 75 9 Infinity* 0.01 Tumor grade High grade(grade 2, 3) 220 59  4.59 (1.44-14.67) 0.010 Low grade (grade 1) 60 7 7.58 (0.91-63.08) 0.061 Tumor size  ≦2 cm 167 28 14.16 (1.92-104.22)0.009  >2 cm 113 38  2.64 (0.81-8.58) 0.110 Age ≦55 yrs 144 30 13.58(1.85-99.74) 0.010 >55 yrs 136 36  3.45 (1.06-11.26) 0.040 Menopausalstatus Premenopausal 102 21  7.85 (1.05-58.49) 0.044 Postmenopausal 15740  4.88 (1.51-15.84) 0.008 *34 MS low-risk with 0 events and 41 MShigh-risk with 9 events

TABLE 14 Diagnostic accuracy and predictive values of MS and Adjuvant!risk groups for distant metastases within 10 years for untreatedpatients from Guy's Hospital Distant Metastases within 10 yearsSensitivity Specificity PPV NPV Risk Classification 95% CI 95% CI 95% CI95% CI MS risk group 0.94 0.3 0.23 0.97 (0.84-0.98) (0.24-0.37)(0.21-0.25) (0.88-0.99) Adjuvant! risk group 0.90 0.31 0.23 0.93(0.78-0.96) (0.26-0.38) (0.20-0.25) (0.85-0.97)

TABLE 15 Clinical and Pathological Characteristics of both Untreated andTreated Patients from University of Muenster Tamoxifen All Patients AllTreated Treated Only Characteristics N = 96 N = 62 N = 54 Age Mean (SD)56.64 (12.6) 56.43 (12.4) 57.63 (12.1) >55 yrs   38 (40.6%)   26 (41.9%)  26 (48.1%) <55 yrs   37 (38.5%)   27 (43.5%)   22 (40.7%) Unknown   20(20.8%)    9 (14.5%)    6 (11.1%) T Stage 1   56 (58.3%)   37 (59.7%)  33 (61.1%) 1C    1 (1.0%)    1 (1.6%)    1 (1.9%) 2   37 (38.5%)   22(35.5%)   18 (33.3%) Unknown    2 (2.1%)    2 (3.2%)    2 (3.7%) GradePoor   15 (15.6%)   11 (17.7%)    9 (16.7%) Moderate   41 (42.7%)   31(50.0%)   25 (46.3%) Good    9 (9.4%)    6 (9.7%)    6 (11.1%) Unknown  31 (32.3%)   14 (22.6%)   14 (25.9%) Distant Metastasis Yes   27(28.1%)   16 (25.8%)   14 (25.9%) Follow Up Median (months) 60 70.4 66.5

TABLE 16 Distant-metastasis-free survival rates in MS high-risk andlow-risk patients from University of Muenster No. of Risk No. of 5-yrDMFS Groups Patients Group Patients Rate (SE) All Patients 96 High 480.61 (0.072) Low 48 0.88 (0.052) All Treated 62 High 32 0.66 (0.084) Low30 0.89 (0.060) Tamoxifen 54 High 26 0.65 (0.084) Treated Alone Low 280.88 (0.065)

TABLE 17 Clinical and pathological characteristics of patients from 205treated patients from Guy's Hospital Node-negative ER-positiveCharacteristics (n = 205) Menopausal Status Premenopausal   31 (15.12%)Perimenopausal   4 (1.95%) Postmenopausal  165 (80.49%) Unknown   5(2.44%) Age   ≦55 yrs   74 (36.1%) >55 yrs  131 (63.9%) Mean (Std. dev.)59.3 (10.4) Min.-Max. 33-86 Tumor diameter  ≦2 cm  138 (67.32%)  >2 cm  67 (32.68%) Mean (Std. Dev.) 1.67 (1.0) Min.-Max.  0-3.0 HistologicalGrade Grade 1   60 (29.27%) Grade 2   98 (47.8%) Grade 3   47 (22.93%)Stage I  138 (67.32%) II   67 (32.68%) Subtypes Ductal NOS  164 (80.0%)Lobular Classic   22 (10.73%) Lobular Varient   3 (1.46%) Tubular   8(3.9%) Mucinous   6 (2.93%) Papillary   1 (0.49%) Apocrine   1 (0.49%)Distant Recurrence   Yes   17 (8.29%) No  188 (91.71%) Any recurrenceYes   17 (8.9%) No  188 (91.71%) Death (Any cause)   Yes   44 (21.46%)No  161 (78.54%) Death (Breast Cancer)   Yes   16 (7.8%) No  189(92.20%) Median follow up 9.3 yrs

TABLE 18 Five-year and ten-year distant-metastasis-free survival ratesin high, intermediate, and low MS groups in Guy's treated samples 5 yr10 yr No. of DMFS DMFS Patients DM (SE) (SE) High 40 7 0.872 (0.054)0.804 (0.068) Intermediate 29 2 0.966 (0.034) 0.966 (0.034) Low 136 80.970 (0.015) 0.921 (0.028) Int. + Low 165 10 0.969 (0.014) 0.928(0.025) All 205 17 0.950 (0.015) 0.904 (0.0239)

TABLE 19 Univariate and multivariate Cox proportional hazard regressionof MS risk groups, age, tumor size, and histological grade in Guy'streated samples Hazard Univariate Hazard Multivariate Ratio 95% CIP-value Ratio 95% CI P-value MS high vs. low risk 3.25 1.24-8.54 0.0175.82  1.71-19.75 0.0047 Age (per year) 1.03 0.98-1.08 0.27 1.03 0.98-1.08 0.31 Tumor size (per cm) 1.28 0.77-2.14 0.34 1.17  0.68-2.030.57 Grade 2 vs. Grade 1 3.50 0.78-15.79 0.10 2.64  0.56-12.45 0.22Grade 3 vs. Grade 1 2.72 0.50-14.86 0.25 0.62 0.081-4.76 0.65

TABLE 20 Association of MS risk groups with age, tumor size andhistological grade in Guy's treated patients N Mean Std. Dev. Age High40 59.3 8.3 Int. + Low 165 59.3 10.8 One-way ANOVA p = 0.34 Tumor SizeHigh 40 1.87 0.92 Int. + Low 165 1.61 101 One-way ANOVA p = 0.14 Grade 1Grade 2 Grade 3 Histological grade High 0 9 31 Int. + Low 60 89 16Crammer's V = 0.65 chi-sq test for association p-value < 0.0001

TABLE 21 Performance of MS risk groups (High vs. low risk) in subgroupsof age, tumor size, histological grade and menopausal status N N L95%U95% patient DM HR CI CL P-Value All 205 17 3.25 1.24 8.54 0.017Histologic Grade Grade 1 60 2 NA only low score Grade 2 98 11 10.72 3.00 38.36 0.0003 Grade 3 47 4 1.53 0.16 14.66 0.715 Tumor size <=2 cm138 11 3.66 1.07 12.57 0.0392 >2 cm 67 6 2.99 0.60 14.82 0.18 Age <=55yrs 74 5 7.38 1.00 54.66 0.0503 >55 yrs 131 12 2.78 0.88 8.75 0.0813Menopausal Status premenopausal 31 1 NA postmenopausal 165 15 2.84 1.017.99 0.0476

TABLE 22 Diagnostic values of MS risk groups (high vs. low risk) inGuy's treated samples 5 yr 10 yr Sensitivity  0.50 (0.24-0.76) 0.44(0.23-0.67) Specificity  0.82 (0.76-0.87) 0.85 (0.76-0.92) PPV 0.128(0.068-0.227) 0.24 (0.13-0.41) NPV 0.969 (0.944-0.983) 0.93 (0.90-0.96)

TABLE 23 Clinicopathological characteristics of 234 Japanese samplesPost-menopause Pre-menopause All 112 (47.9%) 122 (52.1%) 234 (100%) Age<= 55 yrs   32 (28.6%)  122 (100%)  154 (65.8%) >55 yrs   80 (71.4%)   0(0%)   80 (34.2%) Mean (Std. dev.) 60.8 (7.8) 44.8 (6.0) 52.5 (10.6)Min.-Max.  43-81  25-54  25-81 Tumor diameter <= 2 cm   65 (58.0%)   81(66.4%)  146 (62.4%) >2 cm   47 (42.0%)   41 (33.6%)   88 (37.6%) Mean(Std. Dev.) 2.15 (1.0) 1.96 (1.2) 2.05 (1.1) Min.-Max. 0.3-8.4 0.1-6.50.1-8.4 Histological grade Grade 1   28 (25%)   46 (37.7%)   74 (31.6%)Grade 2   56 (50%)   57 (46.7%)  113 (48.3%) Grade 3   28 (25%)   19(15.6%)   47 (20.1%) Tumor subtype       II a Papillotubular   42(37.5%)   66 (54.1%)  108 (46.2%) II a Scirrhous   41 (36.6%)   36(29.5%)   77 (32.9%) II a Solid-tubular   24 (21.4%)   14 (11.5%)   38(16.2%) II b Invasive Lobular   2 (1.8%)   2 (1.6%)   4 (1.7%) II bMedullary   1 (0.9%)   0 (0%)   1 (0.4%) II b Mucinous   2 (1.8%)   4(3.3%)   6 (2.6%) Stage I   65 (58.0%)   81 (66.4%)  146 (62.4%) IIA  44 (39.3%)   36 (29.5%)   80 (34.2%) IIB   3 (2.7%)   5 (4.1%)   8(3.4%) +   65 (58%)  113 (92.6%)  178 (76.1%) −   47 (42%)   9 (7.4%)  56 (23.9%) Therapy       Tam/Tam comb.  112 (100%)  102 (83.9%)  214(91.5%) ZOL   0 (0%)   20 (16.1%)   20 (8.5%) Distant Metastasis      Yes   21 (18.8%)   10 (8.9%)   31 (13.3%) No   91 (81.3%)  112 (91.1%) 203 (86.8%) Death of any Cause       Yes   10 (8.9%)   9 (7.3%)   19(8.1%) No  102 (90.9%)  115 (92.7%)  215 91.9%) Recurrence (local anddistant)       Yes   30 (26.8%)   16 (13.1%)   46 (19.7%) No   82(73.2%)  106 (86.9%)  188 (80.3%) Follow-up (years) Median 9 8 8.7

TABLE 24 Five-year and ten-year distant-metastasis-free survival ratesin different MS groups in Japanese patients No. of Patients 5 yr DMFS(SE) 10 yr DMFS (SE) High 95 0.808 (0.042) 0.747 (0.050) Intermediate 620.965 (0.024) 0.912 (0.044) Low 77 0.974 (0.018) 0.887 (0.047) Int. +Low 139  0.97 (0.015) 0.895 (0.034) All 234 0.905 (0.020) 0.837 (0.029)

TABLE 25 Univariate and multivariate Cox proportional hazard model oftime to distant metastases for MS risk groups, age, tumor size andhistological grade in Japanese patients Hazard Univariate HazardMultivariate Ratio 95% CI P-value Ratio 95% CI P-value MS high vs.int. + low 3.32 1.56-7.06 0.0018 3.79 1.42-10.1 0.0078 Age (per year)1.04 1.00-1.08 0.032 1.03 0.99-1.07 0.087 Tissue Size (per cm) 1.451.14-1.83 0.0024 1.4 1.10-1.78 0.007 Hist. grade 2 vs. 1 1.47 0.60-3.610.40 0.72 0.24-2.14 0.56 Hist. grade 3 vs. 1 2.02 0.75-5.44 0.16 0.550.15-2.00 0.36

TABLE 26 Association of MS risk groups with age, tumor size andhistological grade in Japanese patients N Mean Std. Dev. Age High 9553.4 10.4 Int. + Low 139 51.9 10.7 One-way ANOVA p = 0.29 Tumor SizeHigh 95 2.23 1.04 Int. + Low 139 1.93 1.13 One-way ANOVA p = 0.037Histologic grade Grade 1 Grade 2 Grade 3 High 4 54 37 Int. + Low 70 5910 Crammer's V = 0.54 chi-sq test for association p-value <0.0001Subtype 2a Papillotular 2a Scirrhous 2a Solid-tubular 2b Invasivelobular Medullary Mucinous High 45 23 24 1 1 1 Int. + Low 63 54 14 3 0 5Crammer's V = 0.25 chi-sq test for association p-value = 0.01

TABLE 27 Univariate and multivariate Cox proportional hazard model oftime to distant metastases for MS risk groups, menopausal status, tumorsize, PgR status and histological grade in Japanese patients HazardUnivariate Hazard Multivariate Ratio 95% CI P-value Ratio 95% CI P-valueMS high vs. int. + low 3.32 1.56-7.06 0.0018 3.44 1.27-9.34 0.015 Pre,Tam vs. Post, Tam 0.33 0.13-0.81 0.016 0.45 0.17-1.19 0.11 Pre, ZOL vs.Post, Tam 1.22 0.42-3.58 0.72 1.8 0.55-5.81 0.33 Tissue_size (per cm)1.45 1.14-1.83 0.0024 1.45 1.12-1.88 0.0049 PgR (−ve vs. +ve) 2.31.13-4.7  0.022 1.7 0.75-3.86 0.2 Hist. grade 2 vs. 1 1.47  0.6-3.61 0.40.53 0.17-1.62 0.26 Hist. grade 3 vs. 1 2.02 0.75-5.44 0.16 0.490.13-1.76 0.27

TABLE 28 Subgroup analyses: hazard ratio of MS risk groups for time todistant metastases (DMFS) in different subgroups of tumorsize, age,menopausal status, histological grade and PgR status Strata Hazard Ratio95% CI P-value ALL 3.32 1.56-7.06 0.0018 Tumor <= 2 cm 4.48 1.16-17.340.030 Tumor >2 cm 2.27 0.92-5.62 0.077 Age <= 55 4.03 1.38-11.81 0.011Age >55 2.34 0.81-6.74 0.12 post-menopausal 2.06 0.83-5.09 0.12pre-menopausal 6.01 1.55-23.25 0.0094 Grade 1 & 2 3.57 1.52-8.35 0.0034Grade 3 2.35 0.29-18.77 0.42 PgR + 3.48 1.35-9.0 0.0099 PgR − 2.060.57-7.49 0.27

TABLE 29 Diagnostic values of MS to predict distant metastasis in 5years for Japanese samples Cut 1 (int + low combined) Sensitivity 0.81(0.60-0.92) Specificity 0.65 (0.58-0.71) PPV 0.19 (0.15-0.24) NPV 0.97(0.93-0.99)

What is claimed is:
 1. A method of determining risk associated withtumor metastasis in a breast cancer patient, comprising: (a) measuringthe expression level of genes comprising CENPA, PKMYT1, MELK, MYBL2,BUB1, RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11, DIAPH3, ORC6L and CCNB1 inestrogen receptor-positive tumor cells of said breast cancer patient,thereby obtaining a metastasis score (MS) based upon the expressionlevels of said genes; and (b) determining risk of tumor metastasis forsaid breast cancer patient by comparing said metastasis score to atleast one predefined metastasis score cut off threshold (MS threshold).2. The method of claim 1, wherein in step (b), said breast cancerpatient is determined to have an increased risk of tumor metastasis ifits MS is higher than the predefined MS threshold.
 3. The method ofclaim 1, wherein in step (b), said breast cancer patient is determinedto have a decreased risk of tumor metastasis if its MS is lower than thepredefined MS threshold.
 4. The method of claim 1 in which mRNA of eachof said genes is reverse transcribed and amplified by the two primersassociated with the corresponding gene as provided in Table 3, SEQ IDNOS:1-28.
 5. The method of claim 1 in which the expression levels ofsaid genes are normalized against the expression level of at least onehousekeeping gene, or an average of two or more housekeeping genes. 6.The method of claim 5, wherein the housekeeping gene is selected fromthe group consisting of NUP214, PPIG and SLU7.
 7. The method of claim 6in which the mRNA of said housekeeping gene is reverse transcribed andamplified by the two primers associated with said housekeeping gene asprovided in Table 3, SEQ ID NOS:29-34.
 8. The method of claim 1 in whichthe metastasis score (MS) is calculated by the following:${MS} = {{a\; 0} + {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}}}$ whereM=14, Gi=the expression level of each gene (i) of the fourteen saidgenes, a0=0.022, and ai corresponds to the value presented in Table 2for each of said genes.
 9. The method of claim 1 in which the metastasisscore (MS) is calculated by the following:${MS} = {{a\; 0} + {b*\left\lbrack {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}} \right\rbrack}}$where M=14, Gi=the standardized expression level of each gene (i) of thefourteen said genes, a0=0.022, b=−0.251 and ai corresponds to the valuepresented in Table 2 for each of said genes.
 10. The method of claim 1in which the metastasis score (MS) is calculated by the following:${MS} = {{a\; 0} + {b*\left\lbrack {\sum\limits_{i = 1}^{M}\; {{ai}*{Gi}}} \right\rbrack}}$where M=14, Gi=the expression level of each gene (i) of the fourteensaid genes, a0=0.8657, b=−0.04778 and ai=1 for each of said genes. 11.The method of claim 1 in which the metastasis score (MS) is calculatedby the following:${{MS}({new})} = {{- 1}\text{/}14*\left\lbrack {\sum\limits_{i = 1}^{14}\; {Gi}} \right\rbrack}$where Gi=the expression level of each gene (i) of the fourteen saidgenes.
 12. The method of claim 1 in which the metastasis score (MS) iscalculated by adding together the expression level of each gene of thefourteen said genes.
 13. A kit comprising reagents for detecting theexpression levels of genes comprising CENPA, PKMYT1, MELK, MYBL2, BUB1,RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11, DIAPH3, ORC6L and CCNB1; anenzyme; and a buffer.
 14. The kit of claim 13, further comprising atleast one reagent for detecting the expression level of at least onehousekeeping gene.
 15. The kit of claim 14, wherein the housekeepinggene is selected from the group consisting of NUP214, PPIG and SLU7. 16.The kit of claim 13, wherein said reagents comprise at least one primerwhich comprises a nucleotide sequence selected from the group consistingof SEQ ID NOS:1-28.
 17. The kit of claim 15, wherein said reagentcomprises at least one primer which comprises a nucleotide sequenceselected from the group consisting of SEQ ID NOS:29-34.
 18. A microarraycomprising polynucleotides that hybridize to genes comprising CENPA,PKMYT1, MELK, MYBL2, BUB1, RACGAP1, TK1, UBE2S, DC13, RFC4, PRR11,DIAPH3, ORC6L and CCNB1.
 19. The microarray of claim 18, furthercomprising one or more polynucleotides that each hybridize to ahousekeeping gene.
 20. The microarray of claim 19, wherein thehousekeeping gene is selected from the group consisting of NUP214, PPIGand SLU7.